SlideShare a Scribd company logo
1
Data Analysis with Pandas
Prof. Rahul Borate
G H Raisoni College of Engineering and Management , Pune
Agenda
• DataFrame Basic
• Importing and exporting data (CSV, Excel, JSON)
• Data Cleaning- handling missing values, duplicates
• Data wrangling-filtering, grouping and merging datasets
Why pandas?
• One of the most popular library that data scientists use
• Labeled axes to avoid misalignment of data
• Missing values or special values may need to be removed or
replaced
height Weight Weight2 age Gender
Amy 160 125 126 32 2
Bob 170 167 155 -1 1
Chris 168 143 150 28 1
David 190 182 NA 42 1
Ella 175 133 138 23 2
Frank 172 150 148 45 1
salary Credit score
Alice 50000 700
Bob NA 670
Chris 60000 NA
David -99999 750
Ella 70000 685
Tom 45000 660
Overview
• Created by Wes McKinney in 2008, now maintained by
many others.
• Powerful and productive Python data analysis and
Management Library
• Its an open source product.
Overview
• Python Library to provide data analysis features similar to:
R, MATLAB, SAS
• Rich data structures and functions to make working with data
structure fast, easy and expressive.
• It is built on top of NumPy
• Key components provided by Pandas:
• DataFrame
pandas.DataFrame
pandas.DataFrame(data, index , columns , dtype , copy )
• data
• data takes various forms like ndarray, series, map, lists, dict, constants and also
another DataFrame.
• index
• For the row labels, the Index to be used for the resulting frame is Optional Default
np.arrange(n) if no index is passed.
• columns
• For column labels, the optional default syntax is - np.arrange(n). This is only true if
no index is passed.
• dtype
• Data type of each column.
• copy
• This command (or whatever it is) is used for copying of data, if the default is False.
DataFrame Basics
• A DataFrame is a tabular data structure comprised of rows and
columns, same as a spreadsheet or database table.
• It can be treated as an ordered collection of columns
• Each column can be a different data type
• Have both row and column indices
data = {'state': [‘Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame)
#output
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9
DataFrame – specifying columns and indices
• Order of columns/rows can be specified.
• Columns not in data will have NaN.
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])
Print(frame2)
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN
Same order
Initialized with NaN
DataFrame – index, columns, values
frame3.index.name = 'year'
frame3.columns.name='state‘
print(frame3)
state Nevada Ohio
year
2000 NaN 1.5
2001 2.9 1.7
2002 2.9 3.6
print(frame3.index)
Int64Index([2000, 2001, 2002], dtype='int64', name='year')
print(frame3.columns)
Index(['Nevada', 'Ohio'], dtype='object', name='state')
print(frame3.values)
[[nan 1.5]
[2.9 1.7]
[2.9 3.6]]
DataFrame – retrieving a column
• A column in a DataFrame can be retrieved as a Series by
dict-like notation or as attribute
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame['state'])
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
Name: state, dtype: object
print(frame.state)
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
Name: state, dtype: object
Activity 1
• Download the following csv file and load it to your python
module or use the url directly in pd.read_csv(url) which will
read it to a DataFrame
• https://www.cs.odu.edu/~sampath/courses/f19/cs620/files/data/values.csv
• Calculate the average and standard deviation (std) of the
column factor_1 and display the result.
• Pandas mean() and std()
DataFrame – getting rows
• loc for using indexes and iloc for using positions
• loc gets rows (or columns) with particular labels from the index.
• iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])
print(frame2)
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN
print(frame2.loc['A'])
year 2000
state Ohio
pop 1.5
debt NaN
Name: A, dtype: object
print(frame2.loc[['A', 'B']])
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
print(frame2.loc['A':'E',
['state','pop']])
state pop
A Ohio 1.5
B Ohio 1.7
C Ohio 3.6
D Nevada 2.4
E Nevada 2.9
print(frame2.iloc[1:3])
year state pop debt
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN
print(frame2.iloc[:,1:3])
state pop
A Ohio 1.5
B Ohio 1.7
C Ohio 3.6
D Nevada 2.4
E Nevada 2.9
DataFrame – modifying columns
frame2['debt'] = 0
print(frame2)
year state pop debt
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 0
C 2002 Ohio 3.6 0
D 2001 Nevada 2.4 0
E 2002 Nevada 2.9 0
frame2['debt'] = range(5)
print(frame2)
year state pop debt
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 1
C 2002 Ohio 3.6 2
D 2001 Nevada 2.4 3
E 2002 Nevada 2.9 4
val = Series([10, 10, 10], index = ['A', 'C', 'D'])
frame2['debt'] = val
print(frame2)
year state pop debt
A 2000 Ohio 1.5 10.0
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 10.0
D 2001 Nevada 2.4 10.0
E 2002 Nevada 2.9 NaN
Rows or individual elements can be
modified similarly. Using loc or iloc.
DataFrame – removing columns
del frame2['debt']
print(frame2)
year state pop
A 2000 Ohio 1.5
B 2001 Ohio 1.7
C 2002 Ohio 3.6
D 2001 Nevada 2.4
E 2002 Nevada 2.9
Removing rows/columns
print(frame)
c1 c2 c3
r1 0 1 2
r2 3 4 5
r3 6 7 8
print(frame.drop(['r1']))
c1 c2 c3
r2 3 4 5
r3 6 7 8
print(frame.drop(['r1','r3']))
c1 c2 c3
r2 3 4 5
print(frame.drop(['c1'], axis=1))
c2 c3
r1 1 2
r2 4 5
r3 7 8
This returns a new object
Handling missing data
•Why Fill in the Missing Data?.
It is necessary to fill in missing data values in
datasets, as most of the machine learning models that
you want to use will provide an error if you pass NaN
values into them.
The easiest way is to naturally handle missing data in
Python by just filling them up with 0, but it’s essential to
note that this approach can potentially reduce your
model accuracy significantly.
Handling missing data
•How to Know If the Data Has Missing Values?
Missing values are usually represented in the form of Nan or null
or None in the dataset.
 df.info() The function can be used to give information about the
dataset, including insights into missing data in Python.
 This function is one of the most used functions for data analysis.
This will provide you with the column names and the number of
non–null values in each column.
 It will also display the data types of each column. Thus, we can
find out which number columns are where null values are
present, and by looking at the data types, we can have an
understanding of which value to replace nulls with when
addressing missing data in Python.
Handling missing data
•How to Know If the Data Has Missing Values?
 The second way of finding whether we have null values in the
data is by using the isnull() function.
 print(df.isnull().sum())
Handling missing data
•Different Methods of Dealing With Missing Data
1. Deleting the column with missing data
updated_df = df.dropna(axis=1)
updated_df.info()
2. Deleting the row with missing data
If there is a certain row with missing data, then you can delete the
entire row with all the features in that row.
axis=1 is used to drop the column with NaN values.
axis=0 is used to drop the row with NaN values.
updated_df = newdf.dropna(axis=0)
Handling missing data
•Different Methods of Dealing With Missing Data
3. Filling the Missing Values – Imputation
The possible ways to do this are:
Filling the missing data with the mean or median value if it’s a
numerical variable.
Filling the missing data with mode if it’s a categorical value.
Filling the numerical value with 0 or -999, or some other number
that will not occur in the data. This can be done so that the machine
can recognize that the data is not real or is different.
Filling the categorical value with a new type for the missing values.
You can use the fillna() function to fill the null values in the dataset.
updated_df = df
updated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean()
)

More Related Content

PPTX
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
PPTX
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
PDF
R programming & Machine Learning
PPTX
Lab 2 - Managing Data in R Basic Conecpt.pptx
PDF
Data analystics with R module 3 cseds vtu
PPTX
Lecture 9.pptx
PDF
PDF
Oracle sql tutorial
Pandas yayyyyyyyyyyyyyyyyyin Python.pptx
fINAL Lesson_5_Data_Manipulation_using_R_v1.pptx
R programming & Machine Learning
Lab 2 - Managing Data in R Basic Conecpt.pptx
Data analystics with R module 3 cseds vtu
Lecture 9.pptx
Oracle sql tutorial

Similar to Python libraries for analysis Pandas.pptx (20)

PPTX
SQL Windowing
PDF
Data Analytics ,Data Preprocessing What is Data Preprocessing?
PDF
Data science using python, Data Preprocessing
PPTX
2. Data Preprocessing with Numpy and Pandas.pptx
PPTX
Lecture 1 Pandas Basics.pptx machine learning
PPTX
Data Exploration in R.pptx
PPT
Introduction to r language programming.ppt
PPTX
Data Analytics with R and SQL Server
PPTX
Unit 3_Numpy_Vsp.pptx
PPTX
Python Programming.pptx
PDF
R data-import, data-export
 
PPTX
Data Types of R.pptx
PPTX
Introduction to R programming Language.pptx
PDF
Data import-cheatsheet
PPTX
Pandas Dataframe reading data Kirti final.pptx
PPTX
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
PPTX
II B.Sc IT DATA STRUCTURES.pptx
PPTX
Think Like Spark: Some Spark Concepts and a Use Case
PPTX
Python Pandas.pptx
PPTX
Pandas csv
SQL Windowing
Data Analytics ,Data Preprocessing What is Data Preprocessing?
Data science using python, Data Preprocessing
2. Data Preprocessing with Numpy and Pandas.pptx
Lecture 1 Pandas Basics.pptx machine learning
Data Exploration in R.pptx
Introduction to r language programming.ppt
Data Analytics with R and SQL Server
Unit 3_Numpy_Vsp.pptx
Python Programming.pptx
R data-import, data-export
 
Data Types of R.pptx
Introduction to R programming Language.pptx
Data import-cheatsheet
Pandas Dataframe reading data Kirti final.pptx
Group B - Pandas Pandas is a powerful Python library that provides high-perfo...
II B.Sc IT DATA STRUCTURES.pptx
Think Like Spark: Some Spark Concepts and a Use Case
Python Pandas.pptx
Pandas csv
Ad

Recently uploaded (20)

PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
UNIT-1 - COAL BASED THERMAL POWER PLANTS
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPT
Project quality management in manufacturing
PPTX
CYBER-CRIMES AND SECURITY A guide to understanding
PDF
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
PPTX
Construction Project Organization Group 2.pptx
PDF
Digital Logic Computer Design lecture notes
PPTX
Lecture Notes Electrical Wiring System Components
PPTX
additive manufacturing of ss316l using mig welding
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
PPTX
Sustainable Sites - Green Building Construction
PDF
Well-logging-methods_new................
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
PPTX
Geodesy 1.pptx...............................................
Automation-in-Manufacturing-Chapter-Introduction.pdf
UNIT 4 Total Quality Management .pptx
UNIT-1 - COAL BASED THERMAL POWER PLANTS
R24 SURVEYING LAB MANUAL for civil enggi
Project quality management in manufacturing
CYBER-CRIMES AND SECURITY A guide to understanding
Mohammad Mahdi Farshadian CV - Prospective PhD Student 2026
Construction Project Organization Group 2.pptx
Digital Logic Computer Design lecture notes
Lecture Notes Electrical Wiring System Components
additive manufacturing of ss316l using mig welding
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Sustainable Sites - Green Building Construction
Well-logging-methods_new................
CH1 Production IntroductoryConcepts.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
IOT PPTs Week 10 Lecture Material.pptx of NPTEL Smart Cities contd
Geodesy 1.pptx...............................................
Ad

Python libraries for analysis Pandas.pptx

  • 1. 1 Data Analysis with Pandas Prof. Rahul Borate G H Raisoni College of Engineering and Management , Pune
  • 2. Agenda • DataFrame Basic • Importing and exporting data (CSV, Excel, JSON) • Data Cleaning- handling missing values, duplicates • Data wrangling-filtering, grouping and merging datasets
  • 3. Why pandas? • One of the most popular library that data scientists use • Labeled axes to avoid misalignment of data • Missing values or special values may need to be removed or replaced height Weight Weight2 age Gender Amy 160 125 126 32 2 Bob 170 167 155 -1 1 Chris 168 143 150 28 1 David 190 182 NA 42 1 Ella 175 133 138 23 2 Frank 172 150 148 45 1 salary Credit score Alice 50000 700 Bob NA 670 Chris 60000 NA David -99999 750 Ella 70000 685 Tom 45000 660
  • 4. Overview • Created by Wes McKinney in 2008, now maintained by many others. • Powerful and productive Python data analysis and Management Library • Its an open source product.
  • 5. Overview • Python Library to provide data analysis features similar to: R, MATLAB, SAS • Rich data structures and functions to make working with data structure fast, easy and expressive. • It is built on top of NumPy • Key components provided by Pandas: • DataFrame
  • 7. • data • data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. • index • For the row labels, the Index to be used for the resulting frame is Optional Default np.arrange(n) if no index is passed. • columns • For column labels, the optional default syntax is - np.arrange(n). This is only true if no index is passed. • dtype • Data type of each column. • copy • This command (or whatever it is) is used for copying of data, if the default is False.
  • 8. DataFrame Basics • A DataFrame is a tabular data structure comprised of rows and columns, same as a spreadsheet or database table. • It can be treated as an ordered collection of columns • Each column can be a different data type • Have both row and column indices data = {'state': [‘Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = DataFrame(data) print(frame) #output state year pop 0 Ohio 2000 1.5 1 Ohio 2001 1.7 2 Ohio 2002 3.6 3 Nevada 2001 2.4 4 Nevada 2002 2.9
  • 9. DataFrame – specifying columns and indices • Order of columns/rows can be specified. • Columns not in data will have NaN. frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E']) Print(frame2) year state pop debt A 2000 Ohio 1.5 NaN B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 NaN D 2001 Nevada 2.4 NaN E 2002 Nevada 2.9 NaN Same order Initialized with NaN
  • 10. DataFrame – index, columns, values frame3.index.name = 'year' frame3.columns.name='state‘ print(frame3) state Nevada Ohio year 2000 NaN 1.5 2001 2.9 1.7 2002 2.9 3.6 print(frame3.index) Int64Index([2000, 2001, 2002], dtype='int64', name='year') print(frame3.columns) Index(['Nevada', 'Ohio'], dtype='object', name='state') print(frame3.values) [[nan 1.5] [2.9 1.7] [2.9 3.6]]
  • 11. DataFrame – retrieving a column • A column in a DataFrame can be retrieved as a Series by dict-like notation or as attribute data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame = DataFrame(data) print(frame['state']) 0 Ohio 1 Ohio 2 Ohio 3 Nevada 4 Nevada Name: state, dtype: object print(frame.state) 0 Ohio 1 Ohio 2 Ohio 3 Nevada 4 Nevada Name: state, dtype: object
  • 12. Activity 1 • Download the following csv file and load it to your python module or use the url directly in pd.read_csv(url) which will read it to a DataFrame • https://www.cs.odu.edu/~sampath/courses/f19/cs620/files/data/values.csv • Calculate the average and standard deviation (std) of the column factor_1 and display the result. • Pandas mean() and std()
  • 13. DataFrame – getting rows • loc for using indexes and iloc for using positions • loc gets rows (or columns) with particular labels from the index. • iloc gets rows (or columns) at particular positions in the index (so it only takes integers). data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002], 'pop': [1.5, 1.7, 3.6, 2.4, 2.9]} frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E']) print(frame2) year state pop debt A 2000 Ohio 1.5 NaN B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 NaN D 2001 Nevada 2.4 NaN E 2002 Nevada 2.9 NaN print(frame2.loc['A']) year 2000 state Ohio pop 1.5 debt NaN Name: A, dtype: object print(frame2.loc[['A', 'B']]) year state pop debt A 2000 Ohio 1.5 NaN B 2001 Ohio 1.7 NaN print(frame2.loc['A':'E', ['state','pop']]) state pop A Ohio 1.5 B Ohio 1.7 C Ohio 3.6 D Nevada 2.4 E Nevada 2.9 print(frame2.iloc[1:3]) year state pop debt B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 NaN print(frame2.iloc[:,1:3]) state pop A Ohio 1.5 B Ohio 1.7 C Ohio 3.6 D Nevada 2.4 E Nevada 2.9
  • 14. DataFrame – modifying columns frame2['debt'] = 0 print(frame2) year state pop debt A 2000 Ohio 1.5 0 B 2001 Ohio 1.7 0 C 2002 Ohio 3.6 0 D 2001 Nevada 2.4 0 E 2002 Nevada 2.9 0 frame2['debt'] = range(5) print(frame2) year state pop debt A 2000 Ohio 1.5 0 B 2001 Ohio 1.7 1 C 2002 Ohio 3.6 2 D 2001 Nevada 2.4 3 E 2002 Nevada 2.9 4 val = Series([10, 10, 10], index = ['A', 'C', 'D']) frame2['debt'] = val print(frame2) year state pop debt A 2000 Ohio 1.5 10.0 B 2001 Ohio 1.7 NaN C 2002 Ohio 3.6 10.0 D 2001 Nevada 2.4 10.0 E 2002 Nevada 2.9 NaN Rows or individual elements can be modified similarly. Using loc or iloc.
  • 15. DataFrame – removing columns del frame2['debt'] print(frame2) year state pop A 2000 Ohio 1.5 B 2001 Ohio 1.7 C 2002 Ohio 3.6 D 2001 Nevada 2.4 E 2002 Nevada 2.9
  • 16. Removing rows/columns print(frame) c1 c2 c3 r1 0 1 2 r2 3 4 5 r3 6 7 8 print(frame.drop(['r1'])) c1 c2 c3 r2 3 4 5 r3 6 7 8 print(frame.drop(['r1','r3'])) c1 c2 c3 r2 3 4 5 print(frame.drop(['c1'], axis=1)) c2 c3 r1 1 2 r2 4 5 r3 7 8 This returns a new object
  • 17. Handling missing data •Why Fill in the Missing Data?. It is necessary to fill in missing data values in datasets, as most of the machine learning models that you want to use will provide an error if you pass NaN values into them. The easiest way is to naturally handle missing data in Python by just filling them up with 0, but it’s essential to note that this approach can potentially reduce your model accuracy significantly.
  • 18. Handling missing data •How to Know If the Data Has Missing Values? Missing values are usually represented in the form of Nan or null or None in the dataset.  df.info() The function can be used to give information about the dataset, including insights into missing data in Python.  This function is one of the most used functions for data analysis. This will provide you with the column names and the number of non–null values in each column.  It will also display the data types of each column. Thus, we can find out which number columns are where null values are present, and by looking at the data types, we can have an understanding of which value to replace nulls with when addressing missing data in Python.
  • 19. Handling missing data •How to Know If the Data Has Missing Values?  The second way of finding whether we have null values in the data is by using the isnull() function.  print(df.isnull().sum())
  • 20. Handling missing data •Different Methods of Dealing With Missing Data 1. Deleting the column with missing data updated_df = df.dropna(axis=1) updated_df.info() 2. Deleting the row with missing data If there is a certain row with missing data, then you can delete the entire row with all the features in that row. axis=1 is used to drop the column with NaN values. axis=0 is used to drop the row with NaN values. updated_df = newdf.dropna(axis=0)
  • 21. Handling missing data •Different Methods of Dealing With Missing Data 3. Filling the Missing Values – Imputation The possible ways to do this are: Filling the missing data with the mean or median value if it’s a numerical variable. Filling the missing data with mode if it’s a categorical value. Filling the numerical value with 0 or -999, or some other number that will not occur in the data. This can be done so that the machine can recognize that the data is not real or is different. Filling the categorical value with a new type for the missing values. You can use the fillna() function to fill the null values in the dataset. updated_df = df updated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean() )

Editor's Notes

  • #12: Obj = pd.read_csv(‘values.csv’)
  • #15: Df = frame.drop('r1', 0) # 0 for row, 1 for columns
  • #17: s.dropna(inplace=True)
  • #18: s.dropna(inplace=True)
  • #19: s.dropna(inplace=True)
  • #20: s.dropna(inplace=True)
  • #21: s.dropna(inplace=True)