Data preprocessing for Machine Learning with R and Python

Dataset
Country Age Salary Purchased
France 44 72000 No
Spain 27 48000 Yes
Germany 30 54000 No
Spain 38 61000 No
Germany 40 Yes
France 35 58000 Yes
Spain 52000 No
France 48 79000 Yes
Germany 50 83000 No
France 37 67000 Yes

File Reading from directory in python
• from tkinter import *
• from tkinter.filedialog import askopenfilename
• root = Tk()
• root.withdraw()
• root.update()
• file_path = askopenfilename()
• root.destroy()

Importing the
libraries
• import numpy as np
• import matplotlib.pyplot as plt
• import pandas as pd

Importing the
dataset
• dataset = pd.read_csv('Data.csv')
• X = dataset.iloc[:, :-1].values
• y = dataset.iloc[:, 3].values

missing data
• from sklearn.preprocessing import Imputer
• imputer = Imputer(missing_values = 'NaN',
strategy = 'mean', axis = 0)
• imputer = imputer.fit(X[:, 1:3])
• X[:, 1:3] = imputer.transform(X[:, 1:3])

Encoding
categorical
data
• from sklearn.preprocessing import
LabelEncoder, OneHotEncoder
• labelencoder_X = LabelEncoder()
• X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
• onehotencoder =
OneHotEncoder(categorical_features = [0])
• X = onehotencoder.fit_transform(X).toarray()

Encoding the
Dependent
Variable
• labelencoder_y = LabelEncoder()
• y = labelencoder_y.fit_transform(y)

Splitting into
Training set
and Test set
• from sklearn.cross_validation import
train_test_split
• X_train, X_test, y_train, y_test =
train_test_split(X, y, test_size = 0.2,
random_state = 42)

Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
NOTE : Apply feature scaling after splitting the data and it is
because the following
• Split it, then scale. Imagine it this way: you have no idea
what real-world data looks like, so you couldn't scale the
training data to it. Your test data is the surrogate for real-
world data, so you should treat it the same way.
• To reiterate: Split, scale your training data, then use the
scaling from your training data on the testing data.

Checking
NULL
• dataset.isnull()
• dataset.isnull().sum()
• Note : dataset is a dataframe

# Data Preprocessing Python
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values
# Taking care of missing data
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)

R : Importing the
dataset
dataset = read.csv('Data.csv')

R : missing
data
• dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x,
na.rm = TRUE)),
dataset$Age)
• dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x,
na.rm = TRUE)),
dataset$Salary)

Encoding
categorical
data
• dataset$Country = factor(dataset$Country,
levels = c('France', 'Spain', 'Germany’),
labels = c(1, 2, 3))
• dataset$Purchased = factor(dataset$Purchased,
levels = c('No', 'Yes’),
labels = c(0, 1))

R : Splitting Training
set and Test set
• PACKAGES :
• install.packages('caTools')
• library(caTools)
• set.seed(123)
split =
sample.split(dataset$DependentVariable,
SplitRatio = 0.8)
training_set = subset(dataset, split ==
TRUE)
test_set = subset(dataset, split == FALSE)

R: Feature Scaling
training_set = scale(training_set)
test_set = scale(test_set)
NOTE : we cant apply the feature scaling to
categorical data in R like python. Here we
have to apply feature selection to only non
categorical features. So our code becomes :
training_set[, 2:3] = scale(training_set [, 2:3])
test_set = scale(test_set [, 2:3])

# Data Preprocessing R
# Importing the dataset
dataset = read.csv('Data.csv')
# Taking care of missing data
dataset$Age = ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary = ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x) mean(x, na.rm = TRUE)),
dataset$Salary)
# Splitting the dataset into the Training set and Test set
# install.packages('caTools')
library(caTools)
set.seed(123)
split = sample.split(dataset$DependentVariable, SplitRatio = 0.8)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)
# Feature Scaling
training_set = scale(training_set)
test_set = scale(test_set)

Data preprocessing for Machine Learning with R and Python

More Related Content

What's hot (20)

Similar to Data preprocessing for Machine Learning with R and Python (20)

More from Akhilesh Joshi (11)

Recently uploaded (20)

Data preprocessing for Machine Learning with R and Python