Python libraries for analysis Pandas.pptx

1
Data Analysis with Pandas
Prof. Rahul Borate
G H Raisoni College of Engineering and Management , Pune

Agenda
• DataFrame Basic
• Importing and exporting data (CSV, Excel, JSON)
• Data Cleaning- handling missing values, duplicates
• Data wrangling-filtering, grouping and merging datasets

Why pandas?
• One of the most popular library that data scientists use
• Labeled axes to avoid misalignment of data
• Missing values or special values may need to be removed or
replaced
height Weight Weight2 age Gender
Amy 160 125 126 32 2
Bob 170 167 155 -1 1
Chris 168 143 150 28 1
David 190 182 NA 42 1
Ella 175 133 138 23 2
Frank 172 150 148 45 1
salary Credit score
Alice 50000 700
Bob NA 670
Chris 60000 NA
David -99999 750
Ella 70000 685
Tom 45000 660

Overview
• Created by Wes McKinney in 2008, now maintained by
many others.
• Powerful and productive Python data analysis and
Management Library
• Its an open source product.

Overview
• Python Library to provide data analysis features similar to:
R, MATLAB, SAS
• Rich data structures and functions to make working with data
structure fast, easy and expressive.
• It is built on top of NumPy
• Key components provided by Pandas:
• DataFrame

pandas.DataFrame
pandas.DataFrame(data, index , columns , dtype , copy )

• data
• data takes various forms like ndarray, series, map, lists, dict, constants and also
another DataFrame.
• index
• For the row labels, the Index to be used for the resulting frame is Optional Default
np.arrange(n) if no index is passed.
• columns
• For column labels, the optional default syntax is - np.arrange(n). This is only true if
no index is passed.
• dtype
• Data type of each column.
• copy
• This command (or whatever it is) is used for copying of data, if the default is False.

DataFrame Basics
• A DataFrame is a tabular data structure comprised of rows and
columns, same as a spreadsheet or database table.
• It can be treated as an ordered collection of columns
• Each column can be a different data type
• Have both row and column indices
data = {'state': [‘Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame)
#output
state year pop
0 Ohio 2000 1.5
1 Ohio 2001 1.7
2 Ohio 2002 3.6
3 Nevada 2001 2.4
4 Nevada 2002 2.9

DataFrame – specifying columns and indices
• Order of columns/rows can be specified.
• Columns not in data will have NaN.
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])
Print(frame2)
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN
E 2002 Nevada 2.9 NaN
Same order
Initialized with NaN

DataFrame – index, columns, values
frame3.index.name = 'year'
frame3.columns.name='state‘
print(frame3)
state Nevada Ohio
year
2000 NaN 1.5
2001 2.9 1.7
2002 2.9 3.6
print(frame3.index)
Int64Index([2000, 2001, 2002], dtype='int64', name='year')
print(frame3.columns)
Index(['Nevada', 'Ohio'], dtype='object', name='state')
print(frame3.values)
[[nan 1.5]
[2.9 1.7]
[2.9 3.6]]

DataFrame – retrieving a column
• A column in a DataFrame can be retrieved as a Series by
dict-like notation or as attribute
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = DataFrame(data)
print(frame['state'])
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
Name: state, dtype: object
print(frame.state)
0 Ohio
1 Ohio
2 Ohio
3 Nevada
4 Nevada
Name: state, dtype: object

Activity 1
• Download the following csv file and load it to your python
module or use the url directly in pd.read_csv(url) which will
read it to a DataFrame
• https://www.cs.odu.edu/~sampath/courses/f19/cs620/files/data/values.csv
• Calculate the average and standard deviation (std) of the
column factor_1 and display the result.
• Pandas mean() and std()

DataFrame – getting rows
• loc for using indexes and iloc for using positions
• loc gets rows (or columns) with particular labels from the index.
• iloc gets rows (or columns) at particular positions in the index (so it only takes integers).
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'], index=['A', 'B', 'C', 'D', 'E'])
print(frame2)
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN
D 2001 Nevada 2.4 NaN
print(frame2.loc['A'])
year 2000
state Ohio
pop 1.5
debt NaN
Name: A, dtype: object
print(frame2.loc[['A', 'B']])
year state pop debt
A 2000 Ohio 1.5 NaN
B 2001 Ohio 1.7 NaN
print(frame2.loc['A':'E',
['state','pop']])
state pop
A Ohio 1.5
B Ohio 1.7
C Ohio 3.6
D Nevada 2.4
E Nevada 2.9
print(frame2.iloc[1:3])
year state pop debt
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 NaN
print(frame2.iloc[:,1:3])
state pop
A Ohio 1.5
B Ohio 1.7
C Ohio 3.6
D Nevada 2.4
E Nevada 2.9

DataFrame – modifying columns
frame2['debt'] = 0
print(frame2)
year state pop debt
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 0
C 2002 Ohio 3.6 0
D 2001 Nevada 2.4 0
E 2002 Nevada 2.9 0
frame2['debt'] = range(5)
print(frame2)
year state pop debt
A 2000 Ohio 1.5 0
B 2001 Ohio 1.7 1
C 2002 Ohio 3.6 2
D 2001 Nevada 2.4 3
E 2002 Nevada 2.9 4
val = Series([10, 10, 10], index = ['A', 'C', 'D'])
frame2['debt'] = val
print(frame2)
year state pop debt
A 2000 Ohio 1.5 10.0
B 2001 Ohio 1.7 NaN
C 2002 Ohio 3.6 10.0
D 2001 Nevada 2.4 10.0
Rows or individual elements can be
modified similarly. Using loc or iloc.

DataFrame – removing columns
del frame2['debt']
print(frame2)
year state pop
A 2000 Ohio 1.5
B 2001 Ohio 1.7
C 2002 Ohio 3.6
D 2001 Nevada 2.4
E 2002 Nevada 2.9

Removing rows/columns
print(frame)
c1 c2 c3
r1 0 1 2
r2 3 4 5
r3 6 7 8
print(frame.drop(['r1']))
c1 c2 c3
r2 3 4 5
r3 6 7 8
print(frame.drop(['r1','r3']))
c1 c2 c3
r2 3 4 5
print(frame.drop(['c1'], axis=1))
c2 c3
r1 1 2
r2 4 5
r3 7 8
This returns a new object

Handling missing data
•Why Fill in the Missing Data?.
It is necessary to fill in missing data values in
datasets, as most of the machine learning models that
you want to use will provide an error if you pass NaN
values into them.
The easiest way is to naturally handle missing data in
Python by just filling them up with 0, but it’s essential to
note that this approach can potentially reduce your
model accuracy significantly.

•How to Know If the Data Has Missing Values?
Missing values are usually represented in the form of Nan or null
or None in the dataset.
 df.info() The function can be used to give information about the
dataset, including insights into missing data in Python.
 This function is one of the most used functions for data analysis.
This will provide you with the column names and the number of
non–null values in each column.
 It will also display the data types of each column. Thus, we can
find out which number columns are where null values are
present, and by looking at the data types, we can have an
understanding of which value to replace nulls with when
addressing missing data in Python.

•How to Know If the Data Has Missing Values?
 The second way of finding whether we have null values in the
data is by using the isnull() function.
 print(df.isnull().sum())

•Different Methods of Dealing With Missing Data
1. Deleting the column with missing data
updated_df = df.dropna(axis=1)
updated_df.info()
2. Deleting the row with missing data
If there is a certain row with missing data, then you can delete the
entire row with all the features in that row.
axis=1 is used to drop the column with NaN values.
axis=0 is used to drop the row with NaN values.
updated_df = newdf.dropna(axis=0)

•Different Methods of Dealing With Missing Data
3. Filling the Missing Values – Imputation
The possible ways to do this are:
Filling the missing data with the mean or median value if it’s a
numerical variable.
Filling the missing data with mode if it’s a categorical value.
Filling the numerical value with 0 or -999, or some other number
that will not occur in the data. This can be done so that the machine
can recognize that the data is not real or is different.
Filling the categorical value with a new type for the missing values.
You can use the fillna() function to fill the null values in the dataset.
updated_df = df
updated_df['Age']=updated_df['Age'].fillna(updated_df['Age'].mean()
)

Python libraries for analysis Pandas.pptx

More Related Content

Similar to Python libraries for analysis Pandas.pptx (20)

Recently uploaded (20)

Python libraries for analysis Pandas.pptx

Editor's Notes