SlideShare a Scribd company logo
Agenda
•Significance of Exploratory Data Analysis,
•Making sense of Data.
EDA tools and making sense of   data.pdf
Steps followed in Handling Data
• Importing the libraries
• Importing the Dataset
• Handling of Missing Data
• Handling of Categorical Data
• Data Visualization
EDA tools and making sense of   data.pdf
Handling Categorical data
One Hot Encoder
• one_hot_encoded_data = pd.get_dummies(data, columns =
['Remarks', 'Gender'])
• print(one_hot_encoded_data)
Handling Categorical data
• # importing libraries
• import pandas as pd
• import numpy as np
• from sklearn.preprocessing import OneHotEncoder
• # Retrieving data
• data = pd.read_csv('Employee_data.csv')
• # Converting type of columns to category
• data['Gender'] = data['Gender'].astype('category')
• data['Remarks'] = data['Remarks'].astype('category')
Handling Categorical data
• # Assigning numerical values and storing it in another columns
• data['Gen_new'] = data['Gender'].cat.codes
• data['Rem_new'] = data['Remarks'].cat.codes
• # Create an instance of One-hot-encoder
• enc = OneHotEncoder()
• # Passing encoded columns
• enc_data = pd.DataFrame(enc.fit_transform(
• data[['Gen_new', 'Rem_new']]).toarray())
• # Merge with main
• New_df = data.join(enc_data)
• print(New_df)
Output of One hot encoder
Handling Categorical data
(on output
purchasedvariable -y)
Encoding the categorical data
• two categorical variables – country and purchased.
#Categorical data
#for Country Variable
from sklearn.preprocessing import LabelEncoder
label_encoder_x= LabelEncoder()
x[:, 0]= label_encoder_x.fit_transform(x[:, 0])
Label Encoder class has successfully encoded the
variables into digits.
Encode the dependent variable
• For the second categorical variable
- purchased or not purchased -
you can use the “labelencoder”
object of the LableEncoder class.
• OneHotEncoder class - purchased
variable only has two categories
yes or no - which are encoded into
0 and 1.
Output
array([[2, 38.0, 68000.0],
[0, 43.0, 45000.0],
[1, 30.0, 54000.0],
[0, 48.0, 65000.0],
[1, 40.0, 65222.22222222222],
[2, 35.0, 58000.0],
[1, 41.111111111111114, 53000.0],
[0, 49.0, 79000.0],
[2, 50.0, 88000.0],
One hot encoder
• labelencoder_y= LabelEncoder()
• y= labelencoder_y.fit_transform(y)
• output will be –
• Out : array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
Exploratory Data Analysis (EDA)
•method of studying and exploring data sets to
apprehend their predominant traits, discover
patterns, locate outliers, and identify
relationships between variables.
•EDA is normally carried out as a preliminary step
before modelling
Purpose of using EDA tools
vData Visualization
vCorrelation and Relationships
vFeature Engineering
vData Segmentation
vTime Series Analysis
vMissing Data Analysis
vOutlier Analysis
EDA
• approach of analyzing data sets - to summarize their statistical
characteristics
• using statistical graphics and other data visualization methods.
• critical process of performing initial investigations on data so as to
discover patterns, to spot anomalies (anomaly detection) ,to test
hypothesis and to check assumptions with the help of summary
statistics and graphical representations.
• understand the data first and try to gather as many insights from it.
• making sense of data
Read Data set
• import pandas as pd
• import numpy as np
• # read datasdet using pandas
• df =
pd.read_csv('employees.csv')
• df.head()
Histogram
• # importing packages
• import seaborn as sns
• import matplotlib.pyplot as plt
• sns.histplot(x='Salary', data=df, )
• plt.show()
Box Plot
• box plot - distribution of data based
on the five number summary:
• Minimum
• First quartile
• Median
• Third quartile
• Maximum.
Boxplot
• # importing packages
• import seaborn as sns
• import matplotlib.pyplot as plt
• sns.boxplot( x="Salary",
y='Team', data=df, )
• plt.show()
Box plots to visualize outliers
• one of the many ways to visualize
data distribution.
• Using matplotlib or seaborn
• plots the q1 (25th percentile), q2
(50th percentile or median) and q3
(75th percentile) of the data along
with (q1–1.5*(q3-
q1)) and (q3+1.5*(q3-q1)).
• Outliers - points above and below
the plot.
Anomaly Detection – outliers with Boxplot
• anomalous data - linked to some sort of
problem or rare event such as hacking,
bank fraud, malfunctioning equipment,
structural defects / infrastructure
failures, or textual errors.
• outlier detection - identification of
unexpected events, observations, or
items that differ significantly from the
norm.
• If applied to unlabelled data -
unsupervised anomaly detection
• pandas “.corr()” function -
visualize the correlation matrix
using a heatmap in seaborn.
• Dark shades represents positive correlation
while lighter shades represents negative
correlation.
• Good practice to remove variableswith zero
correlation during feature selection.
• correlation is zero - No linear relationship
between these two predictors.
• safe to drop these features
EDA tools
• pandas, numpy,matplotlib and
seaborn)
• Typical graphical
techniques used in EDA are:
• Box plot
• Histogram
• Scatter plot

More Related Content

PPTX
EDA.pptx
PPTX
EDA by Sastry.pptx
PPTX
EDA.pptx
PPTX
11-11_EDA Samia.pptx 11-11_EDA Samia.pptx
PPTX
Introduction of data science
PPTX
Introduction to data analyticals123232.pptx
PDF
UNIT -1 Data exploration and visualization ppt
PPTX
Exploratory_Data_Analysis on data analysis using python.pptx
EDA.pptx
EDA by Sastry.pptx
EDA.pptx
11-11_EDA Samia.pptx 11-11_EDA Samia.pptx
Introduction of data science
Introduction to data analyticals123232.pptx
UNIT -1 Data exploration and visualization ppt
Exploratory_Data_Analysis on data analysis using python.pptx

Similar to EDA tools and making sense of data.pdf (20)

PDF
Data_Analytics_for_IoT_Solutions.pptx.pdf
PDF
Lesson 2 data preprocessing
PPTX
Unit 2- Machine Learninnonjjnkbhkhjjljknkmg.pptx
PPTX
Basic Analysis using Python
PDF
Exploratory Data Analysis in Machine Learning
PPTX
CH 4_TYBSC(CS)_Data Science_Visualisation
PPTX
Predicting Employee Churn: A Data-Driven Approach Project Presentation
PDF
Data science using python, Data Preprocessing
PDF
Data Analytics ,Data Preprocessing What is Data Preprocessing?
DOCX
UNIT-4.docx
PDF
Exploratory Data Analysis - Satyajit.pdf
PPTX
Meetup Junio Data Analysis with python 2018
PPTX
Presentation on the basic of numpy and Pandas
PPTX
Data Exploration in Python.pptx
PPTX
Types of Data in Machine Learning, Number aand Categorical
PPTX
Comparing EDA with classical and Bayesian analysis.pptx
PPTX
EDA_Unit1_Charts_Code for your reference.pptx
PDF
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
PDF
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
PPTX
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
Data_Analytics_for_IoT_Solutions.pptx.pdf
Lesson 2 data preprocessing
Unit 2- Machine Learninnonjjnkbhkhjjljknkmg.pptx
Basic Analysis using Python
Exploratory Data Analysis in Machine Learning
CH 4_TYBSC(CS)_Data Science_Visualisation
Predicting Employee Churn: A Data-Driven Approach Project Presentation
Data science using python, Data Preprocessing
Data Analytics ,Data Preprocessing What is Data Preprocessing?
UNIT-4.docx
Exploratory Data Analysis - Satyajit.pdf
Meetup Junio Data Analysis with python 2018
Presentation on the basic of numpy and Pandas
Data Exploration in Python.pptx
Types of Data in Machine Learning, Number aand Categorical
Comparing EDA with classical and Bayesian analysis.pptx
EDA_Unit1_Charts_Code for your reference.pptx
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
UNIT - 5 : 20ACS04 – PROBLEM SOLVING AND PROGRAMMING USING PYTHON
Ad

Recently uploaded (20)

PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PDF
PPT on Performance Review to get promotions
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PDF
composite construction of structures.pdf
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Current and future trends in Computer Vision.pptx
PPTX
additive manufacturing of ss316l using mig welding
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPT
introduction to datamining and warehousing
PDF
Digital Logic Computer Design lecture notes
PPTX
Safety Seminar civil to be ensured for safe working.
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPT
Mechanical Engineering MATERIALS Selection
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
OOP with Java - Java Introduction (Basics)
R24 SURVEYING LAB MANUAL for civil enggi
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPT on Performance Review to get promotions
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
M Tech Sem 1 Civil Engineering Environmental Sciences.pptx
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
composite construction of structures.pdf
Automation-in-Manufacturing-Chapter-Introduction.pdf
Current and future trends in Computer Vision.pptx
additive manufacturing of ss316l using mig welding
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
introduction to datamining and warehousing
Digital Logic Computer Design lecture notes
Safety Seminar civil to be ensured for safe working.
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
Mechanical Engineering MATERIALS Selection
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Internet of Things (IOT) - A guide to understanding
bas. eng. economics group 4 presentation 1.pptx
OOP with Java - Java Introduction (Basics)
Ad

EDA tools and making sense of data.pdf

  • 1. Agenda •Significance of Exploratory Data Analysis, •Making sense of Data.
  • 3. Steps followed in Handling Data • Importing the libraries • Importing the Dataset • Handling of Missing Data • Handling of Categorical Data • Data Visualization
  • 6. One Hot Encoder • one_hot_encoded_data = pd.get_dummies(data, columns = ['Remarks', 'Gender']) • print(one_hot_encoded_data)
  • 7. Handling Categorical data • # importing libraries • import pandas as pd • import numpy as np • from sklearn.preprocessing import OneHotEncoder • # Retrieving data • data = pd.read_csv('Employee_data.csv') • # Converting type of columns to category • data['Gender'] = data['Gender'].astype('category') • data['Remarks'] = data['Remarks'].astype('category')
  • 8. Handling Categorical data • # Assigning numerical values and storing it in another columns • data['Gen_new'] = data['Gender'].cat.codes • data['Rem_new'] = data['Remarks'].cat.codes • # Create an instance of One-hot-encoder • enc = OneHotEncoder() • # Passing encoded columns • enc_data = pd.DataFrame(enc.fit_transform( • data[['Gen_new', 'Rem_new']]).toarray()) • # Merge with main • New_df = data.join(enc_data) • print(New_df)
  • 9. Output of One hot encoder
  • 10. Handling Categorical data (on output purchasedvariable -y)
  • 11. Encoding the categorical data • two categorical variables – country and purchased. #Categorical data #for Country Variable from sklearn.preprocessing import LabelEncoder label_encoder_x= LabelEncoder() x[:, 0]= label_encoder_x.fit_transform(x[:, 0]) Label Encoder class has successfully encoded the variables into digits.
  • 12. Encode the dependent variable • For the second categorical variable - purchased or not purchased - you can use the “labelencoder” object of the LableEncoder class. • OneHotEncoder class - purchased variable only has two categories yes or no - which are encoded into 0 and 1.
  • 13. Output array([[2, 38.0, 68000.0], [0, 43.0, 45000.0], [1, 30.0, 54000.0], [0, 48.0, 65000.0], [1, 40.0, 65222.22222222222], [2, 35.0, 58000.0], [1, 41.111111111111114, 53000.0], [0, 49.0, 79000.0], [2, 50.0, 88000.0],
  • 14. One hot encoder • labelencoder_y= LabelEncoder() • y= labelencoder_y.fit_transform(y) • output will be – • Out : array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])
  • 15. Exploratory Data Analysis (EDA) •method of studying and exploring data sets to apprehend their predominant traits, discover patterns, locate outliers, and identify relationships between variables. •EDA is normally carried out as a preliminary step before modelling
  • 16. Purpose of using EDA tools vData Visualization vCorrelation and Relationships vFeature Engineering vData Segmentation vTime Series Analysis vMissing Data Analysis vOutlier Analysis
  • 17. EDA • approach of analyzing data sets - to summarize their statistical characteristics • using statistical graphics and other data visualization methods. • critical process of performing initial investigations on data so as to discover patterns, to spot anomalies (anomaly detection) ,to test hypothesis and to check assumptions with the help of summary statistics and graphical representations. • understand the data first and try to gather as many insights from it. • making sense of data
  • 18. Read Data set • import pandas as pd • import numpy as np • # read datasdet using pandas • df = pd.read_csv('employees.csv') • df.head()
  • 19. Histogram • # importing packages • import seaborn as sns • import matplotlib.pyplot as plt • sns.histplot(x='Salary', data=df, ) • plt.show()
  • 20. Box Plot • box plot - distribution of data based on the five number summary: • Minimum • First quartile • Median • Third quartile • Maximum.
  • 21. Boxplot • # importing packages • import seaborn as sns • import matplotlib.pyplot as plt • sns.boxplot( x="Salary", y='Team', data=df, ) • plt.show()
  • 22. Box plots to visualize outliers • one of the many ways to visualize data distribution. • Using matplotlib or seaborn • plots the q1 (25th percentile), q2 (50th percentile or median) and q3 (75th percentile) of the data along with (q1–1.5*(q3- q1)) and (q3+1.5*(q3-q1)). • Outliers - points above and below the plot.
  • 23. Anomaly Detection – outliers with Boxplot • anomalous data - linked to some sort of problem or rare event such as hacking, bank fraud, malfunctioning equipment, structural defects / infrastructure failures, or textual errors. • outlier detection - identification of unexpected events, observations, or items that differ significantly from the norm. • If applied to unlabelled data - unsupervised anomaly detection
  • 24. • pandas “.corr()” function - visualize the correlation matrix using a heatmap in seaborn.
  • 25. • Dark shades represents positive correlation while lighter shades represents negative correlation. • Good practice to remove variableswith zero correlation during feature selection. • correlation is zero - No linear relationship between these two predictors. • safe to drop these features
  • 26. EDA tools • pandas, numpy,matplotlib and seaborn) • Typical graphical techniques used in EDA are: • Box plot • Histogram • Scatter plot