SlideShare a Scribd company logo
Data Preprocessing
Dataset
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes
Python
File Reading from directory in python
• from tkinter import *
• from tkinter.filedialog import askopenfilename
• root = Tk()
• root.withdraw()
• root.update()
• file_path = askopenfilename()
• root.destroy()
Importing the
libraries
• import numpy as np
• import matplotlib.pyplot as plt
• import pandas as pd
Importing the
dataset
• dataset = pd.read_csv('Data.csv')
• X = dataset.iloc[:, :-1].values
• y = dataset.iloc[:, 3].values
missing data
• from sklearn.preprocessing import Imputer
• imputer = Imputer(missing_values = 'NaN',
strategy = 'mean', axis = 0)
• imputer = imputer.fit(X[:, 1:3])
• X[:, 1:3] = imputer.transform(X[:, 1:3])
Encoding
categorical
data
• from sklearn.preprocessing import
LabelEncoder, OneHotEncoder
• labelencoder_X = LabelEncoder()
• X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
• onehotencoder =
OneHotEncoder(categorical_features = [0])
• X = onehotencoder.fit_transform(X).toarray()
Encoding the
Dependent
Variable
• labelencoder_y = LabelEncoder()
• y = labelencoder_y.fit_transform(y)
Splitting into
Training set
and Test set
• from sklearn.cross_validation import
train_test_split
• X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size = 0.2,
random_state = 42)
Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
NOTE : Apply feature scaling after splitting the data and it is
because the following
• Split it, then scale. Imagine it this way: you have no idea
what real-world data looks like, so you couldn't scale the
training data to it. Your test data is the surrogate for real-
world data, so you should treat it the same way.
• To reiterate: Split, scale your training data, then use the
scaling from your training data on the testing data.
Checking
NULL
• dataset.isnull()
• dataset.isnull().sum()
• Note : dataset is a dataframe
# Data Preprocessing Python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
R
R : Importing the
dataset
dataset = read.csv('Data.csv')
R : missing
data
• dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x,
na.rm = TRUE)),
dataset$Age)
• dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x,
na.rm = TRUE)),
dataset$Salary)
Encoding
categorical
data
• dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain', 'Germany’),
labels = c(1, 2, 3))
• dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes’),
labels = c(0, 1))
R : Splitting Training
set and Test set
• PACKAGES :
• install.packages('caTools')
• library(caTools)
• set.seed(123)
split =
sample.split(dataset$DependentVariable,
SplitRatio = 0.8)
training_set = subset(dataset, split ==
TRUE)
test_set = subset(dataset, split == FALSE)
R: Feature Scaling
training_set = scale(training_set)
test_set = scale(test_set)
NOTE : we cant apply the feature scaling to
categorical data in R like python. Here we
have to apply feature selection to only non
categorical features. So our code becomes :
training_set[, 2:3] = scale(training_set [, 2:3])
test_set = scale(test_set [, 2:3])
# Data Preprocessing R
# Importing the dataset
dataset = read.csv('Data.csv')
# Taking care of missing data
dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Salary)
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$DependentVariable, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set = scale(training_set)
test_set = scale(test_set)

More Related Content

PPTX
PCA and LDA in machine learning
PPTX
polynomial linear regression
PPTX
PPTX
Grid search (parameter tuning)
PPTX
R: Apply Functions
PPTX
svm classification
PDF
Linear models
 
PDF
Vectors data frames
 
PCA and LDA in machine learning
polynomial linear regression
Grid search (parameter tuning)
R: Apply Functions
svm classification
Linear models
 
Vectors data frames
 

What's hot (20)

PDF
Regression kriging
 
PDF
4 R Tutorial DPLYR Apply Function
PDF
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
PDF
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
PDF
Cubist
 
PPTX
Introduction of Xgboost
PDF
PDF
5. R basics
 
PDF
10. Getting Spatial
 
PDF
Gradient boosting in practice: a deep dive into xgboost
PDF
9 python data structure-2
PDF
Gradient Boosted Regression Trees in scikit-learn
PPT
Array 31.8.2020 updated
PPTX
Data analysis with R
PDF
PDF
R basics
 
PPT
Optimization toolbox presentation
PPTX
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
PDF
R programming intro with examples
PDF
Second chapter-java
Regression kriging
 
4 R Tutorial DPLYR Apply Function
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Mini-lab 1: Stochastic Gradient Descent classifier, Optimizing Logistic Regre...
Cubist
 
Introduction of Xgboost
5. R basics
 
10. Getting Spatial
 
Gradient boosting in practice: a deep dive into xgboost
9 python data structure-2
Gradient Boosted Regression Trees in scikit-learn
Array 31.8.2020 updated
Data analysis with R
R basics
 
Optimization toolbox presentation
Data Science Academy Student Demo day--Peggy sobolewski,analyzing transporati...
R programming intro with examples
Second chapter-java
Ad

Similar to Data preprocessing for Machine Learning with R and Python (20)

PDF
Machine Learning Algorithms
PPTX
logistic regression with python and R
PPTX
Mata Kuliah AI_Pengenalan Library Python.pptx
PPTX
knn classification
PDF
Scikit learn cheat_sheet_python
PDF
Scikit-learn Cheatsheet-Python
PDF
Cheat Sheet for Machine Learning in Python: Scikit-learn
DOCX
AIMLProgram-6 AIMLProgram-6 AIMLProgram-6 AIMLProgram-6
PDF
Introduction to deep learning using python
PPTX
multiple linear regression
PPTX
Dimension reduction techniques[Feature Selection]
PPTX
ML .pptx
PDF
Pythonで機械学習入門以前
PDF
Pythonbrasil - 2018 - Acelerando Soluções com GPU
PDF
Julie Michelman - Pandas, Pipelines, and Custom Transformers
PPTX
simple linear regression
PPTX
Statistics in Data Science with Python
PPTX
Data Visualization_pandas in hadoop.pptx
PPTX
Naïve Bayes.pptx
PPTX
Numpy_Pandas_for beginners_________.pptx
Machine Learning Algorithms
logistic regression with python and R
Mata Kuliah AI_Pengenalan Library Python.pptx
knn classification
Scikit learn cheat_sheet_python
Scikit-learn Cheatsheet-Python
Cheat Sheet for Machine Learning in Python: Scikit-learn
AIMLProgram-6 AIMLProgram-6 AIMLProgram-6 AIMLProgram-6
Introduction to deep learning using python
multiple linear regression
Dimension reduction techniques[Feature Selection]
ML .pptx
Pythonで機械学習入門以前
Pythonbrasil - 2018 - Acelerando Soluções com GPU
Julie Michelman - Pandas, Pipelines, and Custom Transformers
simple linear regression
Statistics in Data Science with Python
Data Visualization_pandas in hadoop.pptx
Naïve Bayes.pptx
Numpy_Pandas_for beginners_________.pptx
Ad

More from Akhilesh Joshi (11)

PPTX
random forest regression
PPTX
decision tree regression
PPTX
support vector regression
PPTX
R square vs adjusted r square
PPTX
Design patterns
PPTX
Bastion Host : Amazon Web Services
PDF
Design patterns in MapReduce
PPT
Google knowledge graph
DOCX
Machine learning (domingo's paper)
DOC
SoLoMo - Future of Marketing
PPTX
Webcrawler
random forest regression
decision tree regression
support vector regression
R square vs adjusted r square
Design patterns
Bastion Host : Amazon Web Services
Design patterns in MapReduce
Google knowledge graph
Machine learning (domingo's paper)
SoLoMo - Future of Marketing
Webcrawler

Recently uploaded (20)

PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Computer network topology notes for revision
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PDF
Fluorescence-microscope_Botany_detailed content
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Introduction to Knowledge Engineering Part 1
PDF
Clinical guidelines as a resource for EBP(1).pdf
PDF
Introduction to Business Data Analytics.
PDF
Mega Projects Data Mega Projects Data
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Foundation of Data Science unit number two notes
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Quality review (1)_presentation of this 21
Computer network topology notes for revision
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
.pdf is not working space design for the following data for the following dat...
climate analysis of Dhaka ,Banglades.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Fluorescence-microscope_Botany_detailed content
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Reliability_Chapter_ presentation 1221.5784
oil_refinery_comprehensive_20250804084928 (1).pptx
Introduction to Knowledge Engineering Part 1
Clinical guidelines as a resource for EBP(1).pdf
Introduction to Business Data Analytics.
Mega Projects Data Mega Projects Data
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Foundation of Data Science unit number two notes

Data preprocessing for Machine Learning with R and Python

  • 2. Dataset Country Age Salary Purchased France 44 72000 No Spain 27 48000 Yes Germany 30 54000 No Spain 38 61000 No Germany 40 Yes France 35 58000 Yes Spain 52000 No France 48 79000 Yes Germany 50 83000 No France 37 67000 Yes
  • 4. File Reading from directory in python • from tkinter import * • from tkinter.filedialog import askopenfilename • root = Tk() • root.withdraw() • root.update() • file_path = askopenfilename() • root.destroy()
  • 5. Importing the libraries • import numpy as np • import matplotlib.pyplot as plt • import pandas as pd
  • 6. Importing the dataset • dataset = pd.read_csv('Data.csv') • X = dataset.iloc[:, :-1].values • y = dataset.iloc[:, 3].values
  • 7. missing data • from sklearn.preprocessing import Imputer • imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) • imputer = imputer.fit(X[:, 1:3]) • X[:, 1:3] = imputer.transform(X[:, 1:3])
  • 8. Encoding categorical data • from sklearn.preprocessing import LabelEncoder, OneHotEncoder • labelencoder_X = LabelEncoder() • X[:, 0] = labelencoder_X.fit_transform(X[:, 0]) • onehotencoder = OneHotEncoder(categorical_features = [0]) • X = onehotencoder.fit_transform(X).toarray()
  • 9. Encoding the Dependent Variable • labelencoder_y = LabelEncoder() • y = labelencoder_y.fit_transform(y)
  • 10. Splitting into Training set and Test set • from sklearn.cross_validation import train_test_split • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)
  • 11. Feature Scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) NOTE : Apply feature scaling after splitting the data and it is because the following • Split it, then scale. Imagine it this way: you have no idea what real-world data looks like, so you couldn't scale the training data to it. Your test data is the surrogate for real- world data, so you should treat it the same way. • To reiterate: Split, scale your training data, then use the scaling from your training data on the testing data.
  • 13. # Data Preprocessing Python # Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd # Importing the dataset dataset = pd.read_csv('Data.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 3].values # Taking care of missing data from sklearn.preprocessing import Imputer imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0) imputer = imputer.fit(X[:, 1:3]) X[:, 1:3] = imputer.transform(X[:, 1:3]) # Splitting the dataset into the Training set and Test set from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0) # Feature Scaling from sklearn.preprocessing import StandardScaler sc_X = StandardScaler() X_train = sc_X.fit_transform(X_train) X_test = sc_X.transform(X_test) sc_y = StandardScaler() y_train = sc_y.fit_transform(y_train)
  • 14. R
  • 15. R : Importing the dataset dataset = read.csv('Data.csv')
  • 16. R : missing data • dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Age) • dataset$Salary = ifelse(is.na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Salary)
  • 17. Encoding categorical data • dataset$Country = factor(dataset$Country, levels = c('France', 'Spain', 'Germany’), labels = c(1, 2, 3)) • dataset$Purchased = factor(dataset$Purchased, levels = c('No', 'Yes’), labels = c(0, 1))
  • 18. R : Splitting Training set and Test set • PACKAGES : • install.packages('caTools') • library(caTools) • set.seed(123) split = sample.split(dataset$DependentVariable, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE)
  • 19. R: Feature Scaling training_set = scale(training_set) test_set = scale(test_set) NOTE : we cant apply the feature scaling to categorical data in R like python. Here we have to apply feature selection to only non categorical features. So our code becomes : training_set[, 2:3] = scale(training_set [, 2:3]) test_set = scale(test_set [, 2:3])
  • 20. # Data Preprocessing R # Importing the dataset dataset = read.csv('Data.csv') # Taking care of missing data dataset$Age = ifelse(is.na(dataset$Age), ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Age) dataset$Salary = ifelse(is.na(dataset$Salary), ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)), dataset$Salary) # Splitting the dataset into the Training set and Test set # install.packages('caTools') library(caTools) set.seed(123) split = sample.split(dataset$DependentVariable, SplitRatio = 0.8) training_set = subset(dataset, split == TRUE) test_set = subset(dataset, split == FALSE) # Feature Scaling training_set = scale(training_set) test_set = scale(test_set)