Python Pandas DataFrame

Python Pandas DataFrame

Python Pandas DataFrame is a powerful tool for data manipulation and analysis. It provides a convenient way to store, manipulate, and analyze data in tabular form. In this article, we will explore the various functionalities of Pandas DataFrame, accompanied by detailed examples and explanations.

Note 1: This article is an extension to the main Pandas article below:


Note 2: We will be using Google Colaboratory Python notebooks to avoid setup and environment delays. The focus of this article is to get you up and running in Machine Learning with Python, and we can do all that we need there.


To begin, let's first understand what a Pandas DataFrame is. A DataFrame is a two-dimensional labeled data structure, similar to a table in a relational database or a spreadsheet. It consists of rows and columns, where each column can be of a different data type (integer, float, string, etc.). It offers a wide range of functions to work with the data, including methods for data alignment and handling missing values.


Creating a DataFrame:

Let's start with an example of creating a DataFrame from scratch. We can pass a dictionary of lists or arrays as an argument to the 'pd.DataFrame()' function. Each list corresponds to a column in the DataFrame, and the keys of the dictionary represent the column names. Here's an example:

import pandas as pd

data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
       'Age': [25, 32, 18, 27],
       'City': ['New York', 'London', 'Paris', 'Sydney']}

df = pd.DataFrame(data)

print(df)
        

Output:

  Name   Age    City
0  John   25  New York
1 Alice   32    London
2   Bob   18     Paris
3 Emily   27    Sydney
        

In the above example, we created a DataFrame with three columns: 'Name', 'Age', and 'City'. Each column is represented by a list of values. The row index is automatically generated by default, starting from 0.


Article content

For more details on all the ways to create Pandas DataFrame, please visit the following article:

Structure and Shape:

  • Rows and Columns: A DataFrame consists of rows and columns, where each row represents an individual data record, and each column represents a specific attribute or variable.
  • Shape: The shape of a DataFrame refers to the number of rows and columns it contains. You can access the shape of a DataFrame using the shape attribute.

Structure and Shape Example:

import pandas as pd

# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Emma', 'Robert'],
       'Age': [25, 28, 31],
       'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)

# Accessing the shape of the DataFrame
print("Shape of DataFrame:", df.shape)
        

Output:

  Name  Age      City
0  John   25  New York
1  Emma   28     Paris
2  Robert 31    London

Shape of DataFrame: (4, 3)
        


Article content


============

Data Manipulation:

DataFrame offers several attributes and functions to access, manipulate, and analyze the data. Let's explore a few of them:

We will be using the same DataFrame we created earlier.

Accessing columns:

We can access individual columns in a DataFrame using the column name as an index. For example:

# Selecting a single column
column = df['Name']

# Selecting multiple columns
columns = df[['Name', 'Age']]
        


Article content


Article content

Column Operations:


Accessing rows:

We can access rows using the 'iloc[]' function, where 'iloc[row_index]' returns the row at the specified index. For example:

print(df.iloc[0])  # Access first row
        

Output:

Name         John
Age            25
City    New York
Name: 0, dtype: object
        


Article content


Row Operations:


Filtering data:

We can filter data based on specific conditions by using boolean indexing. For example, to filter the rows where the age is greater than 25:

filtered_df = df[df['Age'] > 25]
print(filtered_df)
        

Output:

  Name  Age    City
1 Alice   32  London
3 Emily   27  Sydney
        


Article content



Sorting data:

We can sort the DataFrame based on one or more columns using the 'sort_values()' function. For example, to sort the DataFrame in descending order of age:

sorted_df = df.sort_values('Age', ascending=False)
print(sorted_df)
        

Output:

  Name  Age      City
1 Alice   32    London
3 Emily   27    Sydney
0  John   25  New York
2   Bob   18     Paris
        

Sorting a Pandas DataFrame based on multiple columns

Sorting a Pandas DataFrame based on multiple columns can be done using the .sort_values() function. This function allows us to specify one or more columns by which we want to sort the DataFrame. Let's go through an example to understand how it works:

import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
       'Age': [25, 32, 18, 27],
       'City': ['New York', 'London', 'Paris', 'Sydney']}

df = pd.DataFrame(data)

# Sorting the DataFrame based on multiple columns
sorted_df = df.sort_values(['Age', 'Name'], ascending=[True, True])
print(sorted_df)
        

Output:

  Name  Age      City
2   Bob   18     Paris
0  John   25  New York
3 Emily   27    Sydney
1 Alice   32    London
        

In the above example, we created a DataFrame with three columns: 'Name', 'Age', and 'City'. To sort the DataFrame based on multiple columns, we passed a list of column names to the .sort_values() function. The first column in the list ('Age') will be the primary sorting key, and the second column ('Name') will be the secondary sorting key.

Additionally, we specified the ascending parameter to control the sort order for each column. By default, it is set to "True", indicating ascending order. In this example, we set ascending=[True, True] to sort the 'Age' column in ascending order and the 'Name' column in ascending order as well.

The resulting DataFrame (sorted_df) is sorted based on the 'Age' column first and then sorted alphabetically based on the 'Name' column (in case of ties in the 'Age' column). This way, we can apply multiple sorting criteria to the DataFrame.

It's important to note that the .sort_values() function creates a new sorted DataFrame rather than modifying the original DataFrame. If you want to modify the original DataFrame, you can use the inplace=True parameter in the .sort_values() function.

By sorting a DataFrame based on multiple columns, we can order the data according to our desired criteria and gain insights from the sorted arrangement. This can be particularly useful when analyzing data with multiple sorting dimensions.


Handling missing values:

Pandas DataFrame provides various functions to handle missing or NaN (Not a Number) values. We can use the 'fillna()' function to fill missing values with a specific value. For example, let's add a missing value to the 'City' column and fill it with 'Unknown':

df['City'].iloc[2] = pd.NA
df['City'] = df['City'].fillna('Unknown')
print(df)
        

Output:

  Name  Age      City
0  John   25  New York
1 Alice   32    London
2   Bob   18   Unknown
3 Emily   27    Sydney
        

In the above example, we used the 'pd.NA' value to represent the missing value. Then, we used the 'fillna()' function to replace the missing value with 'Unknown'.


Grouping data:

# Grouping data by a column and calculating mean
grouped_data = df.groupby('City')['Age'].mean()
        


Article content


The groupby() function in pandas is a powerful tool used for grouping rows of a DataFrame based on the values in one or more columns. It's often used in conjunction with aggregation functions to perform calculations on these grouped data. The groupby() operation involves the following steps:

1. Splitting: The data is divided into groups based on one or more specified columns.

2. Applying: An aggregation function (or multiple aggregation functions) is applied to each group individually. These functions calculate summary statistics for each group.

3. Combining: The results of the aggregation functions are combined into a new DataFrame or Series, where each row corresponds to a unique combination of the grouping columns.

Here's the basic syntax of using groupby():

```python

grouped = df.groupby('column_name')

```

Where 'column_name' is the column by which you want to group the data.

Let's illustrate with an example. Assume you have the following DataFrame named sales:

```

Product Category Sales

0 A X 100

1 B Y 150

2 A Y 200

3 C X 120

4 B X 180

5 C Y 220

```

You can use groupby() to calculate the total sales for each product across categories:

import pandas as pd

data = {'Product': ['A', 'B', 'A', 'C', 'B', 'C'],

        'Category': ['X', 'Y', 'Y', 'X', 'X', 'Y'],

        'Sales': [100, 150, 200, 120, 180, 220]}

sales = pd.DataFrame(data)
print(sales)

grouped = sales.groupby('Product')

total_sales_per_product = grouped['Sales'].sum()

print(total_sales_per_product)        


Article content


In this example, we're using groupby() to group the data by the 'Product' column. We then apply the sum() function to calculate the total sales for each product. The result is a new Series with the total sales for each product.

You can also group by multiple columns by passing a list of column names to groupby(). Additionally, you can use various aggregation functions like sum(), mean(), max(), min(), and more, to compute summary statistics for each group.

grouped = df.groupby(['column_name_1', 'column_name_2'])

result = grouped['column_to_aggregate'].agg(['sum', 'mean', 'max', 'min'])        

The agg() function allows you to specify multiple aggregation functions to be applied to the grouped data simultaneously.


Subsetting a DataFrame

To subset a DataFrame in Python, you can use various techniques to extract specific rows or columns based on specific conditions or criteria. Here are a few examples of how you can subset a DataFrame in Python, along with their outputs:

  1. Subsetting Rows Based on Condition:


Article content

In the above example, we use the condition `df['Age'] > 28` to subset the DataFrame `df` and create a new DataFrame `subset` where only the rows with an 'Age' greater than 28 are selected.

You can also filter more based on the City and age as follows:

city_filter = df['City'] == 'London'
age_filter = df['Age'] > 25
print(df[city_filter & age_filter ])        


   Name  Age    City
1  Emma   32  London        
Article content

Let's break down the code:

city_filter = df['City'] == 'London'
age_filter = df['Age'] > 25
print(df[city_filter & age_filter])
        

  1. city_filter = df['City'] == 'London': This line creates a boolean filter called city_filter which checks if the value in the 'City' column of the DataFrame df is equal to 'London'. It returns a boolean Series with True for rows where the condition is met and False for rows where it is not met.
  2. age_filter = df['Age'] > 25: This line creates another boolean filter called age_filter which checks if the value in the 'Age' column of the DataFrame df is greater than 25. It also returns a boolean Series with True for rows where the condition is met and False for rows where it is not met.
  3. print(df[city_filter & age_filter]): This line uses the boolean filters city_filter and age_filter to subset the DataFrame df. By using the & operator (bitwise AND operator), it creates a combined filter that only selects rows where both conditions are True. The resulting DataFrame is then printed.

So, the final output will be a subset of the original DataFrame df that contains rows where the 'City' column value is 'London' and the 'Age' column value is greater than 25.

For multiple categorical values, you can use .isin() method as follows:

print(df[df['City'].isin(['London', 'Sydney'])])        


Article content


2. Subsetting Columns:

Subsetting columns from a DataFrame involve selecting specific columns or excluding certain columns from the DataFrame. Here are a few techniques and examples to help you understand subsetting columns in Python:

Before we start and to be aligned on the example, please create the following DataFrame:

import pandas as pd

# Create a DataFrame
data = {'Name': ['John', 'Emma', 'Adam', 'Emily'],
       'Age': [28, 32, 25, 30],
       'City': ['New York', 'London', 'Paris', 'Sydney']}
df = pd.DataFrame(data)        

  • Subsetting one Column

# Subset only the 'Name' column
subset = df['Name']

print(subset)        


Article content

  • Subsetting more than one Columsn




# Subset only the 'Name' and 'City' columns
subset = df[['Name', 'City']]

print(subset)
        

Output:

  Name       City
0  John   New York
1  Emma     London
2  Adam      Paris
3  Emily     Sydney
        

In this example, we subset the DataFrame df and create a new DataFrame subset that consists of only the 'Name' and 'City' columns.

subset = df[['Name', 'City']]

In the line of code: subset = df[['Name', 'City']], two pairs of square brackets are used to subset the DataFrame df and select specific columns.

The outer pair of square brackets, [], is used to slice or subset the DataFrame by columns. Within these brackets, we pass a list of column names ['Name', 'City'] to indicate the specific columns we want to select.

The inner pair of square brackets, '', is used to define a list literal in Python. It is used to enclose the column names as list elements.

By using two pairs of square brackets, we are effectively passing a list of column names as an argument to the DataFrame's indexing operator. This syntax allows us to select multiple columns from the DataFrame and create a new subset DataFrame with only the specified columns.


Article content


  1. Subsetting Rows and Columns:

data = {'Name': ['John', 'Emma', 'Adam', 'Emily'], 'Age': [28, 32, 25, 30], 'City': ['New York', 'London', 'Paris', 'Sydney']} 
df = pd.DataFrame(data)

subset = df.loc[df['Age'] > 28, ['Name', 'City']]

print(subset)        


Article content

In this example, we use the loc function to subset both rows based on the condition df['Age'] > 28 and columns ['Name', 'City']. The resulting DataFrame subset includes only the selected rows and columns. These are just a few examples of subsetting a DataFrame in Python. Depending on your specific needs, you can combine various conditions and techniques to extract the desired data from a DataFrame.





Data Visualization:

Pandas integrates well with other visualization libraries like Matplotlib and Seaborn to create insightful plots and charts.

Plotting a line chart:

df.plot(x='Year', y='Sales', kind='line')
        

Creating a bar chart:

df.plot(x='Country', y='Population', kind='bar')
        


Loading Data:

Pandas allows you to load data from various sources such as CSV, Excel, SQL databases, and more.

Loading CSV data:

data = pd.read_csv('data.csv')
        

Let's load one of the sample files in Google Colab by following this article:

If you followed the article above, you will be able to continue with the following code:

import pandas as pd

# Load the CSV file into a DataFrame

df = pd.read_csv("california_housing_train.csv")

# Display the first few rows of the DataFrame to verify the data load

df.head()        


Article content



Loading Excel data:

data = pd.read_excel('data.xlsx')
        

Loading JSON data:

Pandas provides a function called pd.read_json() that allows you to read JSON data and convert it into a Series or DataFrame.

Here's an example of creating a Pandas Series from a JSON file:

Suppose we have a JSON file called data.json with the following content:

{
 "A": 10,
 "B": 20,
 "C": 30,
 "D": 40,
 "E": 50
}
        

To create a Pandas Series from this JSON file, we can use the pd.read_json() function as follows:

import pandas as pd

# Read JSON file and create a Series
series_from_json = pd.read_json('data.json', typ='series')
print(series_from_json)
        

Output:

A    10
B    20
C    30
D    40
E    50
dtype: int64
        

In the example above, we import the pandas library as pd. We then use the pd.read_json() function and pass the file path of the JSON file ('data.json') as the first argument. Additionally, we specify the typ parameter as 'series' to indicate that we want to create a Series from the JSON data.

The resulting Series series_from_json will have the keys from the JSON file as the index labels and the corresponding values as the values of the Series.

Creating a Pandas Series from a JSON file comes in handy when you have JSON data that you want to analyze and manipulate using the rich functionalities of Pandas. It allows you to easily read and transform JSON data into a structured format for further data analysis tasks.


Exporting Data:

Pandas allows you to export data to various formats, including CSV and Excel.

Exporting data to CSV:

df.to_csv('output.csv', index=False)
        

Exporting data to Excel:

df.to_excel('output.xlsx', index=False)
        

===========

Working with Large Data

Since we learned how to load and export data from files, let's explore data manipulation in detail using the file california_housing_train.csv that we loaded previously.

import pandas as pd

# Load the CSV file into a DataFrame

df = pd.read_csv("california_housing_train.csv")
        

Exploring a DataFrame: .head()

The .head() function is a useful method in Pandas DataFrame that allows us to quickly examine the first few rows of our data. It helps us get a glimpse of the structure and contents of the DataFrame.

The .head() function returns the top n rows of the DataFrame, where n is an optional parameter. By default, it displays the first 5 rows. Let's explore a few examples to understand it better:

Article content

By default, .head() displays the first 5 rows of the DataFrame. It provides a quick summary of the data, including the column names and a few rows.

We can also pass an integer value as a parameter to the .head() function to specify the number of rows we want to display. For example, to display the first 3 rows:

print(df.head(3))
        

Output:


Article content

In this case, the .head(3) function displays the first 3 rows of the DataFrame. It allows us to control the number of rows we want to see.

The .head() function is particularly helpful when dealing with large datasets, as it allows us to check the data quickly without having to display the entire DataFrame. It provides a concise overview before diving into further analysis or manipulation.

In addition to the .head() function, Pandas also offers the .tail() function, which works in a similar way but displays the last n rows of the DataFrame. By default, it shows the last 5 rows. This function can be useful when we want to check the last few records in our data.

Overall, the .head() function is a handy method for exploring the initial rows of a Pandas DataFrame, providing a snapshot of the data's structure and contents. It helps us get a quick understanding of the DataFrame before delving into further analysis or processing.

Note: In the previous example, it displayed the output in two sections but only displayed three rows. The .head() function does not allow us to limit the displayed columns. It only controls the number of rows to be displayed from the DataFrame. By default, it displays the first 5 rows. However, if you want to limit the displayed columns, you can achieve that by selecting specific columns from the DataFrame before using the .head() function. Here's an example:


Article content

Alternatively, if you want to limit the displayed columns after using .head(), you can use column indexing or the .loc[] function to select specific columns. Here's an example:


Article content

Exploring a DataFrame with .info()

The .info() function is a useful method in Pandas DataFrame that provides a concise summary of the DataFrame's structure, including information about the columns, data types, and memory usage. It gives us a quick overview of the DataFrame's metadata and can be particularly helpful when dealing with large datasets. Let's explore how to use it:

import pandas as pd

# Creating a sample DataFrame
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
       'Age': [25, 32, 18, 27],
       'City': ['New York', 'London', 'Paris', 'Sydney']}

df = pd.DataFrame(data)

# Using .info() to get DataFrame summary
print(df.info())
        

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
#   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
0   Name    4 non-null      object
1   Age     4 non-null      int64 
2   City    4 non-null      object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
None
        

In the above example, we created a DataFrame with three columns: 'Name', 'Age', and 'City'. By applying the .info() function to the DataFrame, we get a structured summary of the DataFrame's metadata:

  1. The first line displays the class of the DataFrame (<class 'pandas.core.frame.DataFrame'>) and the RangeIndex, which indicates the index range of the DataFrame (in this case, from 0 to 3).
  2. The "Data columns" section provides information about each column in the DataFrame. It displays the column name, the number of non-null values present in the column, and the data type of the column.
  3. The "dtypes" line gives a summary of the data types present in the DataFrame. In this example, 'Name' and 'City' columns are of type 'object' (string) and the 'Age' column is of type 'int64' (integer).
  4. The last line shows the memory usage of the DataFrame, which can be helpful to understand the memory footprint of the DataFrame.

The .info() function gives us valuable information about the structure and composition of the DataFrame. It helps us identify potential data issues, such as missing values or incorrect data types. Additionally, it allows us to assess the memory usage of the DataFrame and make optimizations if needed.

Overall, the .info() function is a handy tool for quickly examining the metadata of a Pandas DataFrame, providing insights about column names, data types, missing values, and memory usage. It is an essential step in understanding and preprocessing the data before performing further analysis or manipulation.

Here's another example of how we can read the CSV file named "california_housing_train.csv"and explore the DataFrame using .info():

import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("california_housing_train.csv")

# Use .info() to get DataFrame summary
print(df.info())
        

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
#   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
0   longitude           17000 non-null  float64
1   latitude            17000 non-null  float64
2   housing_median_age  17000 non-null  float64
3   total_rooms         17000 non-null  float64
4   total_bedrooms      17000 non-null  float64
5   population          17000 non-null  float64
6   households          17000 non-null  float64
7   median_income       17000 non-null  float64
8   median_house_value  17000 non-null  float64
dtypes: float64(9)
memory usage: 1.2 MB
None
        

In this example, we read the "california_housing_train.csv" file using the pd.read_csv() function and stored it in the DataFrame df. Then, we applied the .info() function to get a summary of the DataFrame:

  1. The first line of the output shows the class of the DataFrame (<class 'pandas.core.frame.DataFrame'>) and the RangeIndex, indicating the index range of the DataFrame (from 0 to 16999, denoting 17000 rows).
  2. The "Data columns" section lists the column names and provides information about each column. It includes the column name, the number of non-null values present in the column, and the data type of the column.
  3. The "dtypes" line summarizes the data types of the columns. In this case, all the columns are of type float64, indicating numeric data.
  4. The last line displays the memory usage of the DataFrame, which helps us understand the memory footprint of the DataFrame.

Exploring Columns with df.columns

In Pandas, df.columns is an attribute of a DataFrame df that allows you to access the column labels (column names) of the DataFrame. It returns an Index object containing the names of all the columns in the DataFrame. You can think of it as a list of column names.

Here's how you can use df.columns:

import pandas as pd

# Create a DataFrame

data = {'Name': ['Alice', 'Bob', 'Charlie'],

        'Age': [25, 30, 35],

        'City': ['New York', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)

# Access the column names

column_names = df.columns

# Print the column names

print(column_names)        

Output:


Index(['Name', 'Age', 'City'], dtype='object')        


In the example above, df.columns returns an Index object containing the column names 'Name', 'Age', and 'City'. You can use these column names for various DataFrame operations, such as selecting specific columns, renaming columns, or performing calculations on specific columns.


Exploring Data with df.describe()

The df.describe() method in Pandas is used to generate descriptive statistics of a DataFrame. It provides summary statistics for the numerical columns in the DataFrame, such as count, mean, standard deviation, minimum, and maximum values. This method is particularly useful for getting a quick overview of your data's central tendencies and dispersion.

Here's how to use df.describe():

import pandas as pd

# Create a DataFrame
data = {'Age': [25, 30, 35, 40, 45],
        'Salary': [50000, 60000, 75000, 90000, 80000]}
df = pd.DataFrame(data)

# Use df.describe()
summary = df.describe()

# Print the summary statistics
print(summary)
        


Output:

             Age        Salary
count   5.000000      5.000000
mean   35.000000  73000.000000
std     7.071068  14832.048741
min    25.000000  50000.000000
25%    30.000000  60000.000000
50%    35.000000  75000.000000
75%    40.000000  80000.000000
max    45.000000  90000.000000
        

The output includes the following statistics for each numerical column:

  • count: The number of non-null values in the column.
  • mean: The mean (average) value of the column.
  • std: The standard deviation, which measures the spread or dispersion of the data.
  • min: The minimum value in the column.
  • 25%: The 25th percentile value (also known as the first quartile).
  • 50%: The 50th percentile value (also known as the median or second quartile).
  • 75%: The 75th percentile value (also known as the third quartile).
  • max: The maximum value in the column.

These statistics help you understand the central tendency, spread, and distribution of the data. For example, in the output above, you can see that the 'Age' column has a mean of 35 years, a standard deviation of approximately 7.07, and ranges from 25 to 45. Similarly, the 'Salary' column has a mean of $73,000, a standard deviation of approximately $14,832, and ranges from $50,000 to $90,000.

Here are the results for the file:

"/content/sample_data/california_housing_train.csv"        
import pandas as pd

# Load the CSV file into a DataFrame

df = pd.read_csv("/content/sample_data/california_housing_train.csv")

df.describe()
        

Output:


Article content


Article content


Article content


df.transpose() (swap rows and columns)


Let's use it with the previous example to transpose the .describe()

Article content


The df.transpose() method in Pandas is used to transpose (swap rows and columns) a DataFrame. This operation flips the DataFrame along its diagonal axis. It can be particularly useful when you want to interchange rows and columns for various reasons, such as changing the orientation of your data for better analysis or visualization.

Here's how to use df.transpose():


import pandas as pd

# Create a DataFrame

data = {'A': [1, 2, 3],

        'B': [4, 5, 6],

        'C': [7, 8, 9]}

df = pd.DataFrame(data)

# Transpose the DataFrame

transposed_df = df.transpose()

# Print the transposed DataFrame

print(transposed_df)        


Original DataFrame:

```

   A  B  C

0  1  4  7

1  2  5  8

2  3  6  9

```        

Transposed DataFrame:

```

   0  1  2

A  1  2  3

B  4  5  6

C  7  8  9

```        

As you can see, the rows of the original DataFrame become columns in the transposed DataFrame, and vice versa. The column labels in the original DataFrame become the index labels in the transposed DataFrame.

You might use df.transpose() when you want to perform operations or analysis on the data that are more naturally expressed with the rows and columns interchanged. For instance, it can be helpful when the original DataFrame has features (columns) as rows, and you want to analyze them as columns.

Keep in mind that df.transpose() returns a new DataFrame with the transposed data, and it doesn't modify the original DataFrame in place. If you want to update the original DataFrame, you can assign the result back to it:

```python

df = df.transpose()  # Assign the transposed DataFrame back to df        

```

Additionally, you can use the .T attribute as a shorthand for transposing a DataFrame:

```python

transposed_df = df.T  # Equivalent to df.transpose()        


Article content


Dropping a row from a DataFrame

To drop a row by index from a DataFrame in pandas, you can use the drop() function with the index parameter set to the desired row index. Here's an example:

import pandas as pd

# Assuming you have a DataFrame called 'df'

# Drop a row by index
df = df.drop(index=3)

# Print the updated DataFrame
print(df)
        

In the code above, df.drop(index=3) drops the row with index 3 from the DataFrame df. The drop() function is called on the DataFrame object and the index parameter is set to the index value you want to remove. The resulting DataFrame is then assigned back to df. Finally, print(df) is used to display the updated DataFrame after dropping the row.

Make sure to replace 'df' with the actual name of your DataFrame. Additionally, you can remove multiple rows by passing a list of index values to the index parameter, like index=[3, 5, 7], to drop multiple rows at once.


Dropping columns from a DataFrame

To drop columns from a DataFrame in Pandas, you can use the .drop() method. Here is an example using the provided DataFrame:

import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("/content/sample_data/california_housing_train.csv")

# Drop columns using the .drop() method
df.drop(['column_name_1', 'column_name_2'], axis=1, inplace=True)

# The 'column_name_1' and 'column_name_2' columns have been dropped from the DataFrame
        

In the example above, replace 'column_name_1' and 'column_name_2' with the actual names of the columns you want to drop from the DataFrame. The axis=1 argument specifies that the columns should be dropped, and inplace=True ensures that the changes are made directly in the DataFrame.


Let's drop some columns from the df we have loaded from california_housing_train.csv:

import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv("/content/sample_data/california_housing_train.csv")
print(df.head())

        

Let's drop the columns in the red borders below:


Article content
# Drop Columns
df.drop(['median_house_value', 
         'median_income', 'households', 'population', 
         'housing_median_age'], axis = 1, inplace=True)

#print the result
print(df.head())        


Article content

Resetting the index

In a Pandas DataFrame, the index refers to the labels assigned to each row. It provides a unique identifier for each row of data, which helps in data alignment and indexing operations.

Resetting the index means reassigning the default numerical index (0, 1, 2, ...) to the DataFrame. This can be useful in scenarios where you have modified the DataFrame and want to discard the current index, or when you want to convert an indexed DataFrame into a simple numerical index.

There are several methods to reset the index of a DataFrame:

  1. reset_index() method: This method resets the index and adds a new column called 'index' to the DataFrame. By default, the old index is moved to a new column, and a new numerical index is assigned.df.reset_index(inplace=True)
  2. reset_index(drop=True) method: This method resets the index without adding a new column. It discards the old index and assigns a new numerical index to the DataFrame.df.reset_index(drop=True, inplace=True)
  3. set_index() method with drop=True: If you want to reset the index and not keep any existing index as a new column, you can use the set_index() method with drop=True. This sets a new numerical index and discards the old index.df.set_index(np.arange(len(df)), inplace=True)

Note: In all examples, inplace=True is used to modify the DataFrame directly. If you prefer to create a new DataFrame instead of modifying the original one, you can omit inplace=True.

Let's drop the row at index 2 and see what happens to the data

df.drop(index=2, inplace=True)

print(df.head())

        


Article content

Notice that index 2 is gone above. Now, let;s reset the index

df.reset_index(inplace=True)

print(df.head())        


Article content

In this case it kept the old index. Let's drop index 3 and reset with the drop option. Also, we will drop the old Index Column to avoid confusion

df.drop('index', axis=1, inplace=True)
df.drop(index=3,inplace=True)
print(df.head())

df.reset_index(drop=True, inplace=True)
print(df.head())        


Article content


Data Aggregation

Aggregation refers to the process of combining multiple values into a single value, typically by applying a mathematical or statistical function. In the context of pandas DataFrames, aggregation involves grouping data based on certain criteria and performing calculations on those grouped subsets.

Here are some key points to understand about aggregation in pandas:

  1. Grouping: Aggregation often starts with grouping the data based on one or more columns. The groupby() function in pandas is used to create a groupby object, which allows you to group the data based on specific criteria, such as unique values in a column.
  2. Aggregation Functions: After grouping the data, one or more aggregation functions can be applied to calculate summary statistics or perform calculations on the grouped subsets. Common aggregation functions include sum(), mean(), min(), max(), count(), median(), std(), var(), etc.
  3. Specifying Aggregations: The agg() function is used to specify the aggregation functions to apply to specific columns. It takes a dictionary as input, where the keys represent the columns and the values represent the corresponding aggregation functions.
  4. Multiple Aggregations: It is possible to apply multiple aggregation functions to the same column or different aggregation functions to different columns. This can be done by providing a list of aggregation functions as the value in the dictionary passed to agg().
  5. Renaming Columns: By default, the resulting DataFrame after aggregation may have multi-level column names. You can use the rename() function to rename the columns to more meaningful names using a dictionary or by directly assigning new column names.
  6. Resetting Index: After performing aggregation, the resulting DataFrame may have a hierarchical index if the grouping was done on multiple columns. The reset_index() function can be used to convert the index back into regular columns.

Aggregation is a powerful technique for summarizing and analyzing data. It allows you to condense large datasets into meaningful insights by calculating summary statistics, identifying patterns, or deriving new features. pandas provides a simple and efficient way to perform aggregation operations on DataFrames, making it an essential tool for data analysis and manipulation.

Example:

Using the following data, calculate total revenue, and average price of items sold in each store:

data = {

    'Store': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],

    'Item': ['Mango', 'Banana', 'Orange', 'Grape', 'Mango', 'Banana', 'Orange', 'Grape'],

    'Price': [60, 30, 20, 70, 45, 11, 60, 30],

    'Quantity': [20, 10, 40, 30, 50, 20, 40, 45]

}



df = pd.DataFrame(data)

print(df)        


Article content

Let's calculate the total revenue and average price of items sold in each store using the provided data:

  1. Import the necessary libraries:import pandas as pd
  2. Define the data:data = { 'Store': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'], 'Item': ['Mango', 'Banana', 'Orange', 'Grape', 'Mango', 'Banana', 'Orange', 'Grape'], 'Price': [60, 30, 20, 70, 45, 11, 60, 30], 'Quantity': [20, 10, 40, 30, 50, 20, 40, 45] }
  3. Create a DataFrame:df = pd.DataFrame(data)
  4. Calculate the revenue for each row by multiplying 'Price' and 'Quantity':df['Revenue'] = df['Price'] * df['Quantity']
  5. Group the data by 'Store' column and calculate the total revenue and average price using the groupby() and agg() functions:store_summary = df.groupby('Store').agg({'Revenue': 'sum', 'Price': 'mean'}).reset_index()
  6. Rename the columns for clarity using the rename() function:store_summary.rename(columns={'Revenue': 'Total Revenue', 'Price': 'Average Price'}, inplace=True)
  7. Print the store summary DataFrame:print(store_summary)

The output will be:

 Store  Total Revenue  Average Price
0     A           4150          43.75
1     B           2005          35.25
        

This code calculates the total revenue and average price of items sold in each store using pandas DataFrame and aggregation functions. By grouping the data by the 'Store' column and applying the sum() and mean() aggregation functions, we obtain the desired results.

import pandas as pd

data = {

    'Store': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],

    'Item': ['Mango', 'Banana', 'Orange', 'Grape', 'Mango', 'Banana', 'Orange', 'Grape'],

    'Price': [60, 30, 20, 70, 45, 11, 60, 30],

    'Quantity': [20, 10, 40, 30, 50, 20, 40, 45]

}



df = pd.DataFrame(data)
df['Revenue'] = df['Price'] * df['Quantity']

store_summary = df.groupby('Store').agg({'Revenue': 'sum', 'Price': 'mean'}).reset_index()

store_summary.rename(columns={'Price': 'Average_Price'}, inplace=True)

print("Store Summary:\n", store_summary)        

This code demonstrates how to perform aggregation and data processing on a DataFrame using the pandas library in Python. Let's break down the code step by step:

  1. The code begins by importing the pandas library using the import statement.
  2. A dictionary called data is defined, which contains the data that will be used to create the DataFrame. The dictionary consists of four key-value pairs, where each key represents a column name and the corresponding value represents the data for that column.
  3. The dictionary is then used to create a DataFrame df using the pd.DataFrame() function from pandas.
  4. The code adds a new column called 'Revenue' to the DataFrame by multiplying the 'Price' column with the 'Quantity' column.

Article content



  1. The DataFrame is then grouped by the 'Store' column using the groupby() function. This groups the data based on the unique values in the 'Store' column.
  2. The agg() function is applied to the grouped DataFrame to perform aggregation operations on the specified columns. In this case, 'Revenue' is summed using the 'sum' aggregation function, and 'Price' is averaged using the 'mean' aggregation function.

Article content



  1. The reset_index() function is used to reset the index of the resulting DataFrame, making 'Store' a regular column instead of an index.

Article content



  1. The columns 'Revenue' and 'Price' are renamed to 'sum' and 'Average_Price' respectively using the rename() function.
  2. Finally, the resulted store summary DataFrame is printed using the print() function.


Article content


Additional examples:

  1. Aggregating based on multiple columns: Let's say we want to calculate the total revenue and average price for each store and item combination. We can modify the code as follows:

store_item_summary = df.groupby(['Store', 'Item']).agg({'Revenue': 'sum', 'Price': 'mean'}).reset_index()
store_item_summary.rename(columns={'Price': 'Average_Price'}, inplace=True)
print("Store-Item Summary:\n", store_item_summary)
        

  1. Applying multiple aggregation functions: If we need to calculate various statistics for each store, such as total revenue, minimum price, and maximum quantity, we can use a dictionary with multiple aggregation functions in the agg() method:

store_stats = df.groupby('Store').agg({'Revenue': 'sum', 'Price': ['min', 'max'], 'Quantity': 'max'}).reset_index()
store_stats.columns = ['Store', 'Total_Revenue', 'Min_Price', 'Max_Price', 'Max_Quantity']
print("Store Statistics:\n", store_stats)
        


Article content


These examples demonstrate how to aggregate and process data in a DataFrame using pandas. Aggregation functions like sum, mean, min, max, etc., can be used based on the requirements to derive meaningful insights from the data.

Example 2:

Using trhe following data, calculate total amount spent by each customer:

data = {

    'Customer': ['Rany', 'Reem', 'Rany', 'Rany', 'Reem', 'Reem', 'Rany', 'Reem'],

    'Item': ['iPad', 'iPhone', 'Monitor', 'Mouse', 'iPad', 'iPhone', 'Monitor', 'Mouse'],

    'Price': [10, 5, 50, 20, 10, 5, 50, 20],

    'Quantity': [3, 4, 2, 5, 10, 6, 1, 2]

}        

  1. Convert the data to DataFrame

Article content

  1. We create a total spent column

df['total_spent'] = df['Price'] * df['Quantity']
print(df)        


Article content


  1. We need to groupby customer and sum total price


Article content

You can also use the agg() function as follows:

customer_summary = df.groupby('Customer').agg({'total_spent': 'sum'}).reset_index()        


Conclusion:

In conclusion, Python Pandas DataFrame is a powerful tool for managing and analyzing data in a tabular form. It allows easy access, manipulation, and analysis of data, providing numerous functions and methods. With its rich functionality and ease of use, Pandas DataFrame is widely used in data science and analysis tasks.


Pandas Dataframe Cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf




To view or add a comment, sign in

Others also viewed

Explore topics