EDA tools and making sense of data.pdf

Agenda
•Significance of Exploratory Data Analysis,
•Making sense of Data.

Steps followed in Handling Data
• Importing the libraries
• Importing the Dataset
• Handling of Missing Data
• Handling of Categorical Data
• Data Visualization

One Hot Encoder
• one_hot_encoded_data = pd.get_dummies(data, columns =
['Remarks', 'Gender'])
• print(one_hot_encoded_data)

Handling Categorical data
• # importing libraries
• import pandas as pd
• import numpy as np
• from sklearn.preprocessing import OneHotEncoder
• # Retrieving data
• data = pd.read_csv('Employee_data.csv')
• # Converting type of columns to category
• data['Gender'] = data['Gender'].astype('category')
• data['Remarks'] = data['Remarks'].astype('category')

• # Assigning numerical values and storing it in another columns
• data['Gen_new'] = data['Gender'].cat.codes
• data['Rem_new'] = data['Remarks'].cat.codes
• # Create an instance of One-hot-encoder
• enc = OneHotEncoder()
• # Passing encoded columns
• enc_data = pd.DataFrame(enc.fit_transform(
• data[['Gen_new', 'Rem_new']]).toarray())
• # Merge with main
• New_df = data.join(enc_data)
• print(New_df)

(on output
purchasedvariable -y)

Encoding the categorical data
• two categorical variables – country and purchased.
#Categorical data
#for Country Variable
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Label Encoder class has successfully encoded the
variables into digits.

Encode the dependent variable
• For the second categorical variable
- purchased or not purchased -
you can use the “labelencoder”
object of the LableEncoder class.
• OneHotEncoder class - purchased
variable only has two categories
yes or no - which are encoded into
0 and 1.

Output
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],

One hot encoder
• labelencoder_y= LabelEncoder()
• y= labelencoder_y.fit_transform(y)
• output will be –
• Out : array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

Exploratory Data Analysis (EDA)
•method of studying and exploring data sets to
apprehend their predominant traits, discover
patterns, locate outliers, and identify
relationships between variables.
•EDA is normally carried out as a preliminary step
before modelling

Purpose of using EDA tools
vData Visualization
vCorrelation and Relationships
vFeature Engineering
vData Segmentation
vTime Series Analysis
vMissing Data Analysis
vOutlier Analysis

EDA
• approach of analyzing data sets - to summarize their statistical
characteristics
• using statistical graphics and other data visualization methods.
• critical process of performing initial investigations on data so as to
discover patterns, to spot anomalies (anomaly detection) ,to test
hypothesis and to check assumptions with the help of summary
statistics and graphical representations.
• understand the data first and try to gather as many insights from it.
• making sense of data

Read Data set
• import pandas as pd
• import numpy as np
• # read datasdet using pandas
• df =
pd.read_csv('employees.csv')
• df.head()

Histogram
• # importing packages
• import seaborn as sns
• import matplotlib.pyplot as plt
• sns.histplot(x='Salary', data=df, )
• plt.show()

Box Plot
• box plot - distribution of data based
on the five number summary:
• Minimum
• First quartile
• Median
• Third quartile
• Maximum.

Boxplot
• # importing packages
• import seaborn as sns
• import matplotlib.pyplot as plt
• sns.boxplot( x="Salary",
y='Team', data=df, )
• plt.show()

Box plots to visualize outliers
• one of the many ways to visualize
data distribution.
• Using matplotlib or seaborn
• plots the q1 (25th percentile), q2
(50th percentile or median) and q3
(75th percentile) of the data along
with (q1–1.5*(q3-
q1)) and (q3+1.5*(q3-q1)).
• Outliers - points above and below
the plot.

Anomaly Detection – outliers with Boxplot
• anomalous data - linked to some sort of
problem or rare event such as hacking,
bank fraud, malfunctioning equipment,
structural defects / infrastructure
failures, or textual errors.
• outlier detection - identification of
unexpected events, observations, or
items that differ significantly from the
norm.
• If applied to unlabelled data -
unsupervised anomaly detection

• pandas “.corr()” function -
visualize the correlation matrix
using a heatmap in seaborn.

• Dark shades represents positive correlation
while lighter shades represents negative
correlation.
• Good practice to remove variableswith zero
correlation during feature selection.
• correlation is zero - No linear relationship
between these two predictors.
• safe to drop these features

EDA tools
• pandas, numpy,matplotlib and
seaborn)
• Typical graphical
techniques used in EDA are:
• Box plot
• Histogram
• Scatter plot

EDA tools and making sense of data.pdf

More Related Content

Similar to EDA tools and making sense of data.pdf (20)

Recently uploaded (20)

EDA tools and making sense of data.pdf