SlideShare a Scribd company logo
Machine Learning in R
Suja A. Alex,
Assistant Professor,
Dept. of Information Technology,
St.Xavier’s Catholic College of Engineering
Data Science
• Multidisciplinary field
• Data Science is the science
which uses computer science,
statistics and machine learning,
visualization to collect, clean,
integrate, analyze, visualize,
interact with data to create data
products.
• Data science principles apply to
all data – big and small
2
5 Vs of Big Data:
• Raw Data: Volume
• Change over time: Velocity
• Data types: Variety
• Data Quality: Veracity
• Information for Decision Making: Value
3
Machine Learning Algorithms:
4
Input: Datasets in R
• https://vincentarelbundock.github.io/Rdatasets/datasets.html
• http://archive.ics.uci.edu/ml/datasets.php
• https://www.kaggle.com/datasets
5
Output: Data Visualization Packages in R
• graphics - plot(), barplot(), boxplot()
• ggplot2 - Scatterplot
• lattice - tiled plots
• plotly - Line plot, Time series chart, interactive 3D plots
6
1. Cluster Analysis
• Finding groups of objects
• Objects in a group will be similar (or related) to one another and
different from (or unrelated to) the objects in other groups.
7
Minimize Similarity between Clusters
8
Taxonomy of Clustering Algorithms
9
K-means clustering - Example
Data: S={2,3,4,10,11,12,20,25,30}
If we choose K=2
Find first set of Means (choose randomly):
M1=4, M2=12
Assign elements to two clusters K1 and K2:
K1={2,3,4} K2={10,11,12,20,25,30}
Find second set of Means:
M1=(2+3+4)/3=3 M2=(108/6)=18
Re-assign elements to two clusters K1 and K2:
K1={2,3,4,10} K2={11,12,20,25,30}
Now M1=(19/4)=4.75 = 5 M2=19.6 = 20
K1={2,3,4,10,11,12} K2={20,25,30}
M1=7 M2=25
K1={2,3,4,10,11,12} K2={20,25,30}
M1=7 M2=25
If we get same means, the k-means algorithm stops…We got final two clusters…
10
K-means clustering
• Simple unsupervised machine learning algorithm
• Partitional clustering approach
• Each cluster is associated with a centroid or mean (center point)
• Each point is assigned to the cluster with the closest centroid.
• Number of clusters K must be specified.
K-means Algorithm:
11
Clustering packages in R
1. Cluster
2. ClusterR
3. NbClust
Function for k-means in R:
kmeans(x, centers, nstart)
where x  numeric dataset (matrix or data frame)
centers  number of clusters to extract
nstart  generate number of initial configurations
12
K-means-clustering in R
// Before Clustering
// Explore data
library(datasets)
head(iris)
library(ggplot2)
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
// After K-means Clustering
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3)
irisCluster
irisCluster$cluster <- as.factor(irisCluster$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = irisCluster$cluster)) + geom_point()
13
2. Classification
• Categorize our data into a desired and distinct number of classes.
14
Example:
15
Classification Algorithm
• Decision Tree
• Bayes Classifier
• Nearest Neighbor
• Support Vector Machines
• Naive Linear Classifiers (or Logistic Regression)
16
1. Decision Tree
17
Example Decision Tree:
18
Example
19
KNN classification:
• Supervised learning algorithm
• Lazy learning algorithm.
• Based on similarity measure (distance function)
Steps for KNN:
1. Calculate distance (e.g. Euclidean distance, Hamming distance, etc.)
2. Find k closest neighbors
3. Vote for labels or calculate the mean
20
Example:
21
KNN classification in R
df <- data(iris) ##load data
head(iris) ## see the stucture
##Generate a random number that is 90% of the total number of rows in dataset.
ran <- sample(1:nrow(iris), 0.9 * nrow(iris))
##the normalization function is created
nor <-function(x) { (x -min(x))/(max(x)-min(x)) }
##Run nomalization on first 4 coulumns of dataset because they are the predictors
iris_norm <- as.data.frame(lapply(iris[,c(1,2,3,4)], nor))
summary(iris_norm)
##extract training set
iris_train <- iris_norm[ran,]
##extract testing set
iris_test <- iris_norm[-ran,]
##extract 5th column of train dataset because it will be used as 'cl' argument in knn function.
iris_target_category <- iris[ran,5]
##extract 5th column if test dataset to measure the accuracy
iris_test_category <- iris[-ran,5]
##load the package class
library(class)
##run knn function
pr <- knn(iris_train,iris_test,cl=iris_target_category,k=13)
##create confusion matrix
tab <- table(pr,iris_test_category)
##this function divides the correct predictions by total number of predictions that tell us how accurate teh
model is.
accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
accuracy(tab)
22
3. Regression Analysis
1. Linear Regression:
• linear relationship between the input variable (x) and the soutput variable (y).
• Fitting a straight line to data.
2. Multiple linear regression:
When there are multiple input variables, literature from statistics often refers to
the method as multiple linear regression.
23
Simple Linear Regression:
x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131)
y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48)
# Apply the lm() function.
relation <- lm(y~x)
print(relation)
summary(relation)
24
Multiple Linear Regression:
input <- mtcars[,c("mpg","disp","hp","wt")]
print(head(input))
// Create Relationship Model & get the Coefficients
input <- mtcars[,c("mpg","disp","hp","wt")]
# Create the relationship model.
model <- lm(mpg~disp+hp+wt, data = input)
# Show the model.
print(model)
# Get the Intercept and coefficients as vector elements.
cat("# # # # The Coefficient Values # # # ","n")
a <- coef(model)[1]
print(a)
Xdisp <- coef(model)[2]
Xhp <- coef(model)[3]
Xwt <- coef(model)[4]
print(Xdisp)
print(Xhp)
print(Xwt)
25

More Related Content

PPTX
K-means Clustering with Scikit-Learn
PDF
Pyclustering tutorial - K-means
PPTX
K-Means Algorithm Implementation In python
PPTX
Array vs array list
PDF
Introduction data structure
PPTX
Mca ii dfs u-1 introduction to data structure
PPTX
Ist year Msc,2nd sem module1
K-means Clustering with Scikit-Learn
Pyclustering tutorial - K-means
K-Means Algorithm Implementation In python
Array vs array list
Introduction data structure
Mca ii dfs u-1 introduction to data structure
Ist year Msc,2nd sem module1

What's hot (20)

PPTX
Bca ii dfs u-1 introduction to data structure
PPTX
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
PPTX
Introduction To R Language
PDF
Elementary data structure
PDF
Tech Talk - JPA and Query Optimization - publish
PDF
Photon Technical Deep Dive: How to Think Vectorized
PPTX
Bsc cs ii dfs u-1 introduction to data structure
PDF
PPTX
PPTX
Presentation on Heap Sort
PPTX
ArrayList in JAVA
PPTX
Stack Data structure
PDF
Vasia Kalavri – Training: Gelly School
PPS
Data Structure
PPTX
PPTX
Python data structures - best in class for data analysis
PDF
Java ArrayList Tutorial | Edureka
PDF
"Machine Learning and Internet of Things, the future of medical prevention", ...
PDF
PDF
Advanced data structures vol. 1
Bca ii dfs u-1 introduction to data structure
"Quantum clustering - physics inspired clustering algorithm", Sigalit Bechler...
Introduction To R Language
Elementary data structure
Tech Talk - JPA and Query Optimization - publish
Photon Technical Deep Dive: How to Think Vectorized
Bsc cs ii dfs u-1 introduction to data structure
Presentation on Heap Sort
ArrayList in JAVA
Stack Data structure
Vasia Kalavri – Training: Gelly School
Data Structure
Python data structures - best in class for data analysis
Java ArrayList Tutorial | Edureka
"Machine Learning and Internet of Things, the future of medical prevention", ...
Advanced data structures vol. 1
Ad

Similar to Machine Learning in R (20)

PDF
An Introduction to Data Mining with R
PDF
RDataMining slides-clustering-with-r
PPTX
Decision Tree.pptx
PDF
Machine Learning: Classification Concepts (Part 1)
PPTX
R language tutorial
PDF
R Introduction
PDF
R programming & Machine Learning
PDF
ClusterAnalysis
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
PDF
Introduction to R for data science
PPTX
Clustering.pptx
PPTX
Anomaly Detection with Apache Spark
PPTX
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
PPTX
Document clustering for forensic analysis an approach for improving compute...
PDF
Clustering and Visualisation using R programming
PDF
Tulsa techfest Spark Core Aug 5th 2016
PDF
Machine Learning with Python- Machine Learning Algorithms- K-Means Clustering...
PDF
TAO Fayan_Report on Top 10 data mining algorithms applications with R
PDF
Data science with R - Clustering and Classification
PDF
Hands-on - Machine Learning using scikitLearn
An Introduction to Data Mining with R
RDataMining slides-clustering-with-r
Decision Tree.pptx
Machine Learning: Classification Concepts (Part 1)
R language tutorial
R Introduction
R programming & Machine Learning
ClusterAnalysis
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Introduction to R for data science
Clustering.pptx
Anomaly Detection with Apache Spark
A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapRe...
Document clustering for forensic analysis an approach for improving compute...
Clustering and Visualisation using R programming
Tulsa techfest Spark Core Aug 5th 2016
Machine Learning with Python- Machine Learning Algorithms- K-Means Clustering...
TAO Fayan_Report on Top 10 data mining algorithms applications with R
Data science with R - Clustering and Classification
Hands-on - Machine Learning using scikitLearn
Ad

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPTX
Computer network topology notes for revision
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Mega Projects Data Mega Projects Data
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
IB Computer Science - Internal Assessment.pptx
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
PDF
Introduction to Data Science and Data Analysis
PPTX
Managing Community Partner Relationships
PDF
[EN] Industrial Machine Downtime Prediction
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Lecture1 pattern recognition............
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPT
Predictive modeling basics in data cleaning process
Business Analytics and business intelligence.pdf
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Computer network topology notes for revision
SAP 2 completion done . PRESENTATION.pptx
Mega Projects Data Mega Projects Data
oil_refinery_comprehensive_20250804084928 (1).pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction-to-Cloud-ComputingFinal.pptx
IB Computer Science - Internal Assessment.pptx
STERILIZATION AND DISINFECTION-1.ppthhhbx
Introduction to Data Science and Data Analysis
Managing Community Partner Relationships
[EN] Industrial Machine Downtime Prediction
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Lecture1 pattern recognition............
Qualitative Qantitative and Mixed Methods.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Galatica Smart Energy Infrastructure Startup Pitch Deck
Predictive modeling basics in data cleaning process

Machine Learning in R

  • 1. Machine Learning in R Suja A. Alex, Assistant Professor, Dept. of Information Technology, St.Xavier’s Catholic College of Engineering
  • 2. Data Science • Multidisciplinary field • Data Science is the science which uses computer science, statistics and machine learning, visualization to collect, clean, integrate, analyze, visualize, interact with data to create data products. • Data science principles apply to all data – big and small 2
  • 3. 5 Vs of Big Data: • Raw Data: Volume • Change over time: Velocity • Data types: Variety • Data Quality: Veracity • Information for Decision Making: Value 3
  • 5. Input: Datasets in R • https://vincentarelbundock.github.io/Rdatasets/datasets.html • http://archive.ics.uci.edu/ml/datasets.php • https://www.kaggle.com/datasets 5
  • 6. Output: Data Visualization Packages in R • graphics - plot(), barplot(), boxplot() • ggplot2 - Scatterplot • lattice - tiled plots • plotly - Line plot, Time series chart, interactive 3D plots 6
  • 7. 1. Cluster Analysis • Finding groups of objects • Objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. 7
  • 9. Taxonomy of Clustering Algorithms 9
  • 10. K-means clustering - Example Data: S={2,3,4,10,11,12,20,25,30} If we choose K=2 Find first set of Means (choose randomly): M1=4, M2=12 Assign elements to two clusters K1 and K2: K1={2,3,4} K2={10,11,12,20,25,30} Find second set of Means: M1=(2+3+4)/3=3 M2=(108/6)=18 Re-assign elements to two clusters K1 and K2: K1={2,3,4,10} K2={11,12,20,25,30} Now M1=(19/4)=4.75 = 5 M2=19.6 = 20 K1={2,3,4,10,11,12} K2={20,25,30} M1=7 M2=25 K1={2,3,4,10,11,12} K2={20,25,30} M1=7 M2=25 If we get same means, the k-means algorithm stops…We got final two clusters… 10
  • 11. K-means clustering • Simple unsupervised machine learning algorithm • Partitional clustering approach • Each cluster is associated with a centroid or mean (center point) • Each point is assigned to the cluster with the closest centroid. • Number of clusters K must be specified. K-means Algorithm: 11
  • 12. Clustering packages in R 1. Cluster 2. ClusterR 3. NbClust Function for k-means in R: kmeans(x, centers, nstart) where x  numeric dataset (matrix or data frame) centers  number of clusters to extract nstart  generate number of initial configurations 12
  • 13. K-means-clustering in R // Before Clustering // Explore data library(datasets) head(iris) library(ggplot2) ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point() // After K-means Clustering set.seed(20) irisCluster <- kmeans(iris[, 3:4], 3) irisCluster irisCluster$cluster <- as.factor(irisCluster$cluster) ggplot(iris, aes(Petal.Length, Petal.Width, color = irisCluster$cluster)) + geom_point() 13
  • 14. 2. Classification • Categorize our data into a desired and distinct number of classes. 14
  • 16. Classification Algorithm • Decision Tree • Bayes Classifier • Nearest Neighbor • Support Vector Machines • Naive Linear Classifiers (or Logistic Regression) 16
  • 20. KNN classification: • Supervised learning algorithm • Lazy learning algorithm. • Based on similarity measure (distance function) Steps for KNN: 1. Calculate distance (e.g. Euclidean distance, Hamming distance, etc.) 2. Find k closest neighbors 3. Vote for labels or calculate the mean 20
  • 22. KNN classification in R df <- data(iris) ##load data head(iris) ## see the stucture ##Generate a random number that is 90% of the total number of rows in dataset. ran <- sample(1:nrow(iris), 0.9 * nrow(iris)) ##the normalization function is created nor <-function(x) { (x -min(x))/(max(x)-min(x)) } ##Run nomalization on first 4 coulumns of dataset because they are the predictors iris_norm <- as.data.frame(lapply(iris[,c(1,2,3,4)], nor)) summary(iris_norm) ##extract training set iris_train <- iris_norm[ran,] ##extract testing set iris_test <- iris_norm[-ran,] ##extract 5th column of train dataset because it will be used as 'cl' argument in knn function. iris_target_category <- iris[ran,5] ##extract 5th column if test dataset to measure the accuracy iris_test_category <- iris[-ran,5] ##load the package class library(class) ##run knn function pr <- knn(iris_train,iris_test,cl=iris_target_category,k=13) ##create confusion matrix tab <- table(pr,iris_test_category) ##this function divides the correct predictions by total number of predictions that tell us how accurate teh model is. accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100} accuracy(tab) 22
  • 23. 3. Regression Analysis 1. Linear Regression: • linear relationship between the input variable (x) and the soutput variable (y). • Fitting a straight line to data. 2. Multiple linear regression: When there are multiple input variables, literature from statistics often refers to the method as multiple linear regression. 23
  • 24. Simple Linear Regression: x <- c(151, 174, 138, 186, 128, 136, 179, 163, 152, 131) y <- c(63, 81, 56, 91, 47, 57, 76, 72, 62, 48) # Apply the lm() function. relation <- lm(y~x) print(relation) summary(relation) 24
  • 25. Multiple Linear Regression: input <- mtcars[,c("mpg","disp","hp","wt")] print(head(input)) // Create Relationship Model & get the Coefficients input <- mtcars[,c("mpg","disp","hp","wt")] # Create the relationship model. model <- lm(mpg~disp+hp+wt, data = input) # Show the model. print(model) # Get the Intercept and coefficients as vector elements. cat("# # # # The Coefficient Values # # # ","n") a <- coef(model)[1] print(a) Xdisp <- coef(model)[2] Xhp <- coef(model)[3] Xwt <- coef(model)[4] print(Xdisp) print(Xhp) print(Xwt) 25