Python Pandas DataFrame
Python Pandas DataFrame is a powerful tool for data manipulation and analysis. It provides a convenient way to store, manipulate, and analyze data in tabular form. In this article, we will explore the various functionalities of Pandas DataFrame, accompanied by detailed examples and explanations.
Note 1: This article is an extension to the main Pandas article below:
Note 2: We will be using Google Colaboratory Python notebooks to avoid setup and environment delays. The focus of this article is to get you up and running in Machine Learning with Python, and we can do all that we need there.
To begin, let's first understand what a Pandas DataFrame is. A DataFrame is a two-dimensional labeled data structure, similar to a table in a relational database or a spreadsheet. It consists of rows and columns, where each column can be of a different data type (integer, float, string, etc.). It offers a wide range of functions to work with the data, including methods for data alignment and handling missing values.
Creating a DataFrame:
Let's start with an example of creating a DataFrame from scratch. We can pass a dictionary of lists or arrays as an argument to the 'pd.DataFrame()' function. Each list corresponds to a column in the DataFrame, and the keys of the dictionary represent the column names. Here's an example:
import pandas as pd
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
'Age': [25, 32, 18, 27],
'City': ['New York', 'London', 'Paris', 'Sydney']}
df = pd.DataFrame(data)
print(df)
Output:
Name Age City
0 John 25 New York
1 Alice 32 London
2 Bob 18 Paris
3 Emily 27 Sydney
In the above example, we created a DataFrame with three columns: 'Name', 'Age', and 'City'. Each column is represented by a list of values. The row index is automatically generated by default, starting from 0.
For more details on all the ways to create Pandas DataFrame, please visit the following article:
Structure and Shape:
Structure and Shape Example:
import pandas as pd
# Creating a DataFrame from a dictionary
data = {'Name': ['John', 'Emma', 'Robert'],
'Age': [25, 28, 31],
'City': ['New York', 'Paris', 'London']}
df = pd.DataFrame(data)
print(df)
# Accessing the shape of the DataFrame
print("Shape of DataFrame:", df.shape)
Output:
Name Age City
0 John 25 New York
1 Emma 28 Paris
2 Robert 31 London
Shape of DataFrame: (4, 3)
============
Data Manipulation:
DataFrame offers several attributes and functions to access, manipulate, and analyze the data. Let's explore a few of them:
We will be using the same DataFrame we created earlier.
Accessing columns:
We can access individual columns in a DataFrame using the column name as an index. For example:
# Selecting a single column
column = df['Name']
# Selecting multiple columns
columns = df[['Name', 'Age']]
Column Operations:
Accessing rows:
We can access rows using the 'iloc[]' function, where 'iloc[row_index]' returns the row at the specified index. For example:
print(df.iloc[0]) # Access first row
Output:
Name John
Age 25
City New York
Name: 0, dtype: object
Row Operations:
Filtering data:
We can filter data based on specific conditions by using boolean indexing. For example, to filter the rows where the age is greater than 25:
filtered_df = df[df['Age'] > 25]
print(filtered_df)
Output:
Name Age City
1 Alice 32 London
3 Emily 27 Sydney
Sorting data:
We can sort the DataFrame based on one or more columns using the 'sort_values()' function. For example, to sort the DataFrame in descending order of age:
sorted_df = df.sort_values('Age', ascending=False)
print(sorted_df)
Output:
Name Age City
1 Alice 32 London
3 Emily 27 Sydney
0 John 25 New York
2 Bob 18 Paris
Sorting a Pandas DataFrame based on multiple columns
Sorting a Pandas DataFrame based on multiple columns can be done using the .sort_values() function. This function allows us to specify one or more columns by which we want to sort the DataFrame. Let's go through an example to understand how it works:
import pandas as pd
# Creating a sample DataFrame
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
'Age': [25, 32, 18, 27],
'City': ['New York', 'London', 'Paris', 'Sydney']}
df = pd.DataFrame(data)
# Sorting the DataFrame based on multiple columns
sorted_df = df.sort_values(['Age', 'Name'], ascending=[True, True])
print(sorted_df)
Output:
Name Age City
2 Bob 18 Paris
0 John 25 New York
3 Emily 27 Sydney
1 Alice 32 London
In the above example, we created a DataFrame with three columns: 'Name', 'Age', and 'City'. To sort the DataFrame based on multiple columns, we passed a list of column names to the .sort_values() function. The first column in the list ('Age') will be the primary sorting key, and the second column ('Name') will be the secondary sorting key.
Additionally, we specified the ascending parameter to control the sort order for each column. By default, it is set to "True", indicating ascending order. In this example, we set ascending=[True, True] to sort the 'Age' column in ascending order and the 'Name' column in ascending order as well.
The resulting DataFrame (sorted_df) is sorted based on the 'Age' column first and then sorted alphabetically based on the 'Name' column (in case of ties in the 'Age' column). This way, we can apply multiple sorting criteria to the DataFrame.
It's important to note that the .sort_values() function creates a new sorted DataFrame rather than modifying the original DataFrame. If you want to modify the original DataFrame, you can use the inplace=True parameter in the .sort_values() function.
By sorting a DataFrame based on multiple columns, we can order the data according to our desired criteria and gain insights from the sorted arrangement. This can be particularly useful when analyzing data with multiple sorting dimensions.
Handling missing values:
Pandas DataFrame provides various functions to handle missing or NaN (Not a Number) values. We can use the 'fillna()' function to fill missing values with a specific value. For example, let's add a missing value to the 'City' column and fill it with 'Unknown':
df['City'].iloc[2] = pd.NA
df['City'] = df['City'].fillna('Unknown')
print(df)
Output:
Name Age City
0 John 25 New York
1 Alice 32 London
2 Bob 18 Unknown
3 Emily 27 Sydney
In the above example, we used the 'pd.NA' value to represent the missing value. Then, we used the 'fillna()' function to replace the missing value with 'Unknown'.
Grouping data:
# Grouping data by a column and calculating mean
grouped_data = df.groupby('City')['Age'].mean()
The groupby() function in pandas is a powerful tool used for grouping rows of a DataFrame based on the values in one or more columns. It's often used in conjunction with aggregation functions to perform calculations on these grouped data. The groupby() operation involves the following steps:
1. Splitting: The data is divided into groups based on one or more specified columns.
2. Applying: An aggregation function (or multiple aggregation functions) is applied to each group individually. These functions calculate summary statistics for each group.
3. Combining: The results of the aggregation functions are combined into a new DataFrame or Series, where each row corresponds to a unique combination of the grouping columns.
Here's the basic syntax of using groupby():
```python
grouped = df.groupby('column_name')
```
Where 'column_name' is the column by which you want to group the data.
Let's illustrate with an example. Assume you have the following DataFrame named sales:
```
Product Category Sales
0 A X 100
1 B Y 150
2 A Y 200
3 C X 120
4 B X 180
5 C Y 220
```
You can use groupby() to calculate the total sales for each product across categories:
import pandas as pd
data = {'Product': ['A', 'B', 'A', 'C', 'B', 'C'],
'Category': ['X', 'Y', 'Y', 'X', 'X', 'Y'],
'Sales': [100, 150, 200, 120, 180, 220]}
sales = pd.DataFrame(data)
print(sales)
grouped = sales.groupby('Product')
total_sales_per_product = grouped['Sales'].sum()
print(total_sales_per_product)
In this example, we're using groupby() to group the data by the 'Product' column. We then apply the sum() function to calculate the total sales for each product. The result is a new Series with the total sales for each product.
You can also group by multiple columns by passing a list of column names to groupby(). Additionally, you can use various aggregation functions like sum(), mean(), max(), min(), and more, to compute summary statistics for each group.
grouped = df.groupby(['column_name_1', 'column_name_2'])
result = grouped['column_to_aggregate'].agg(['sum', 'mean', 'max', 'min'])
The agg() function allows you to specify multiple aggregation functions to be applied to the grouped data simultaneously.
Subsetting a DataFrame
To subset a DataFrame in Python, you can use various techniques to extract specific rows or columns based on specific conditions or criteria. Here are a few examples of how you can subset a DataFrame in Python, along with their outputs:
In the above example, we use the condition `df['Age'] > 28` to subset the DataFrame `df` and create a new DataFrame `subset` where only the rows with an 'Age' greater than 28 are selected.
You can also filter more based on the City and age as follows:
city_filter = df['City'] == 'London'
age_filter = df['Age'] > 25
print(df[city_filter & age_filter ])
Name Age City
1 Emma 32 London
Let's break down the code:
city_filter = df['City'] == 'London'
age_filter = df['Age'] > 25
print(df[city_filter & age_filter])
So, the final output will be a subset of the original DataFrame df that contains rows where the 'City' column value is 'London' and the 'Age' column value is greater than 25.
For multiple categorical values, you can use .isin() method as follows:
print(df[df['City'].isin(['London', 'Sydney'])])
2. Subsetting Columns:
Subsetting columns from a DataFrame involve selecting specific columns or excluding certain columns from the DataFrame. Here are a few techniques and examples to help you understand subsetting columns in Python:
Before we start and to be aligned on the example, please create the following DataFrame:
import pandas as pd
# Create a DataFrame
data = {'Name': ['John', 'Emma', 'Adam', 'Emily'],
'Age': [28, 32, 25, 30],
'City': ['New York', 'London', 'Paris', 'Sydney']}
df = pd.DataFrame(data)
# Subset only the 'Name' column
subset = df['Name']
print(subset)
# Subset only the 'Name' and 'City' columns
subset = df[['Name', 'City']]
print(subset)
Output:
Name City
0 John New York
1 Emma London
2 Adam Paris
3 Emily Sydney
In this example, we subset the DataFrame df and create a new DataFrame subset that consists of only the 'Name' and 'City' columns.
subset = df[['Name', 'City']]
In the line of code: subset = df[['Name', 'City']], two pairs of square brackets are used to subset the DataFrame df and select specific columns.
The outer pair of square brackets, [], is used to slice or subset the DataFrame by columns. Within these brackets, we pass a list of column names ['Name', 'City'] to indicate the specific columns we want to select.
The inner pair of square brackets, '', is used to define a list literal in Python. It is used to enclose the column names as list elements.
By using two pairs of square brackets, we are effectively passing a list of column names as an argument to the DataFrame's indexing operator. This syntax allows us to select multiple columns from the DataFrame and create a new subset DataFrame with only the specified columns.
data = {'Name': ['John', 'Emma', 'Adam', 'Emily'], 'Age': [28, 32, 25, 30], 'City': ['New York', 'London', 'Paris', 'Sydney']}
df = pd.DataFrame(data)
subset = df.loc[df['Age'] > 28, ['Name', 'City']]
print(subset)
In this example, we use the loc function to subset both rows based on the condition df['Age'] > 28 and columns ['Name', 'City']. The resulting DataFrame subset includes only the selected rows and columns. These are just a few examples of subsetting a DataFrame in Python. Depending on your specific needs, you can combine various conditions and techniques to extract the desired data from a DataFrame.
Data Visualization:
Pandas integrates well with other visualization libraries like Matplotlib and Seaborn to create insightful plots and charts.
Plotting a line chart:
df.plot(x='Year', y='Sales', kind='line')
Creating a bar chart:
df.plot(x='Country', y='Population', kind='bar')
Loading Data:
Pandas allows you to load data from various sources such as CSV, Excel, SQL databases, and more.
Loading CSV data:
data = pd.read_csv('data.csv')
Let's load one of the sample files in Google Colab by following this article:
import pandas as pd
# Load the CSV file into a DataFrame
df = pd.read_csv("california_housing_train.csv")
# Display the first few rows of the DataFrame to verify the data load
df.head()
Loading Excel data:
data = pd.read_excel('data.xlsx')
Loading JSON data:
Pandas provides a function called pd.read_json() that allows you to read JSON data and convert it into a Series or DataFrame.
Here's an example of creating a Pandas Series from a JSON file:
Suppose we have a JSON file called data.json with the following content:
{
"A": 10,
"B": 20,
"C": 30,
"D": 40,
"E": 50
}
To create a Pandas Series from this JSON file, we can use the pd.read_json() function as follows:
import pandas as pd
# Read JSON file and create a Series
series_from_json = pd.read_json('data.json', typ='series')
print(series_from_json)
Output:
A 10
B 20
C 30
D 40
E 50
dtype: int64
In the example above, we import the pandas library as pd. We then use the pd.read_json() function and pass the file path of the JSON file ('data.json') as the first argument. Additionally, we specify the typ parameter as 'series' to indicate that we want to create a Series from the JSON data.
The resulting Series series_from_json will have the keys from the JSON file as the index labels and the corresponding values as the values of the Series.
Creating a Pandas Series from a JSON file comes in handy when you have JSON data that you want to analyze and manipulate using the rich functionalities of Pandas. It allows you to easily read and transform JSON data into a structured format for further data analysis tasks.
Exporting Data:
Pandas allows you to export data to various formats, including CSV and Excel.
Exporting data to CSV:
df.to_csv('output.csv', index=False)
Exporting data to Excel:
df.to_excel('output.xlsx', index=False)
===========
Working with Large Data
Since we learned how to load and export data from files, let's explore data manipulation in detail using the file california_housing_train.csv that we loaded previously.
import pandas as pd
# Load the CSV file into a DataFrame
df = pd.read_csv("california_housing_train.csv")
Exploring a DataFrame: .head()
The .head() function is a useful method in Pandas DataFrame that allows us to quickly examine the first few rows of our data. It helps us get a glimpse of the structure and contents of the DataFrame.
The .head() function returns the top n rows of the DataFrame, where n is an optional parameter. By default, it displays the first 5 rows. Let's explore a few examples to understand it better:
By default, .head() displays the first 5 rows of the DataFrame. It provides a quick summary of the data, including the column names and a few rows.
We can also pass an integer value as a parameter to the .head() function to specify the number of rows we want to display. For example, to display the first 3 rows:
print(df.head(3))
Output:
In this case, the .head(3) function displays the first 3 rows of the DataFrame. It allows us to control the number of rows we want to see.
The .head() function is particularly helpful when dealing with large datasets, as it allows us to check the data quickly without having to display the entire DataFrame. It provides a concise overview before diving into further analysis or manipulation.
In addition to the .head() function, Pandas also offers the .tail() function, which works in a similar way but displays the last n rows of the DataFrame. By default, it shows the last 5 rows. This function can be useful when we want to check the last few records in our data.
Overall, the .head() function is a handy method for exploring the initial rows of a Pandas DataFrame, providing a snapshot of the data's structure and contents. It helps us get a quick understanding of the DataFrame before delving into further analysis or processing.
Note: In the previous example, it displayed the output in two sections but only displayed three rows. The .head() function does not allow us to limit the displayed columns. It only controls the number of rows to be displayed from the DataFrame. By default, it displays the first 5 rows. However, if you want to limit the displayed columns, you can achieve that by selecting specific columns from the DataFrame before using the .head() function. Here's an example:
Alternatively, if you want to limit the displayed columns after using .head(), you can use column indexing or the .loc[] function to select specific columns. Here's an example:
Exploring a DataFrame with .info()
The .info() function is a useful method in Pandas DataFrame that provides a concise summary of the DataFrame's structure, including information about the columns, data types, and memory usage. It gives us a quick overview of the DataFrame's metadata and can be particularly helpful when dealing with large datasets. Let's explore how to use it:
import pandas as pd
# Creating a sample DataFrame
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
'Age': [25, 32, 18, 27],
'City': ['New York', 'London', 'Paris', 'Sydney']}
df = pd.DataFrame(data)
# Using .info() to get DataFrame summary
print(df.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 4 non-null object
1 Age 4 non-null int64
2 City 4 non-null object
dtypes: int64(1), object(2)
memory usage: 224.0+ bytes
None
In the above example, we created a DataFrame with three columns: 'Name', 'Age', and 'City'. By applying the .info() function to the DataFrame, we get a structured summary of the DataFrame's metadata:
The .info() function gives us valuable information about the structure and composition of the DataFrame. It helps us identify potential data issues, such as missing values or incorrect data types. Additionally, it allows us to assess the memory usage of the DataFrame and make optimizations if needed.
Overall, the .info() function is a handy tool for quickly examining the metadata of a Pandas DataFrame, providing insights about column names, data types, missing values, and memory usage. It is an essential step in understanding and preprocessing the data before performing further analysis or manipulation.
Here's another example of how we can read the CSV file named "california_housing_train.csv"and explore the DataFrame using .info():
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv("california_housing_train.csv")
# Use .info() to get DataFrame summary
print(df.info())
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17000 entries, 0 to 16999
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 longitude 17000 non-null float64
1 latitude 17000 non-null float64
2 housing_median_age 17000 non-null float64
3 total_rooms 17000 non-null float64
4 total_bedrooms 17000 non-null float64
5 population 17000 non-null float64
6 households 17000 non-null float64
7 median_income 17000 non-null float64
8 median_house_value 17000 non-null float64
dtypes: float64(9)
memory usage: 1.2 MB
None
In this example, we read the "california_housing_train.csv" file using the pd.read_csv() function and stored it in the DataFrame df. Then, we applied the .info() function to get a summary of the DataFrame:
Exploring Columns with df.columns
In Pandas, df.columns is an attribute of a DataFrame df that allows you to access the column labels (column names) of the DataFrame. It returns an Index object containing the names of all the columns in the DataFrame. You can think of it as a list of column names.
Here's how you can use df.columns:
import pandas as pd
# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'City': ['New York', 'Los Angeles', 'Chicago']}
df = pd.DataFrame(data)
# Access the column names
column_names = df.columns
# Print the column names
print(column_names)
Output:
Index(['Name', 'Age', 'City'], dtype='object')
In the example above, df.columns returns an Index object containing the column names 'Name', 'Age', and 'City'. You can use these column names for various DataFrame operations, such as selecting specific columns, renaming columns, or performing calculations on specific columns.
Exploring Data with df.describe()
The df.describe() method in Pandas is used to generate descriptive statistics of a DataFrame. It provides summary statistics for the numerical columns in the DataFrame, such as count, mean, standard deviation, minimum, and maximum values. This method is particularly useful for getting a quick overview of your data's central tendencies and dispersion.
Here's how to use df.describe():
import pandas as pd
# Create a DataFrame
data = {'Age': [25, 30, 35, 40, 45],
'Salary': [50000, 60000, 75000, 90000, 80000]}
df = pd.DataFrame(data)
# Use df.describe()
summary = df.describe()
# Print the summary statistics
print(summary)
Output:
Age Salary
count 5.000000 5.000000
mean 35.000000 73000.000000
std 7.071068 14832.048741
min 25.000000 50000.000000
25% 30.000000 60000.000000
50% 35.000000 75000.000000
75% 40.000000 80000.000000
max 45.000000 90000.000000
The output includes the following statistics for each numerical column:
These statistics help you understand the central tendency, spread, and distribution of the data. For example, in the output above, you can see that the 'Age' column has a mean of 35 years, a standard deviation of approximately 7.07, and ranges from 25 to 45. Similarly, the 'Salary' column has a mean of $73,000, a standard deviation of approximately $14,832, and ranges from $50,000 to $90,000.
Here are the results for the file:
"/content/sample_data/california_housing_train.csv"
import pandas as pd
# Load the CSV file into a DataFrame
df = pd.read_csv("/content/sample_data/california_housing_train.csv")
df.describe()
Output:
df.transpose() (swap rows and columns)
Let's use it with the previous example to transpose the .describe()
The df.transpose() method in Pandas is used to transpose (swap rows and columns) a DataFrame. This operation flips the DataFrame along its diagonal axis. It can be particularly useful when you want to interchange rows and columns for various reasons, such as changing the orientation of your data for better analysis or visualization.
Here's how to use df.transpose():
import pandas as pd
# Create a DataFrame
data = {'A': [1, 2, 3],
'B': [4, 5, 6],
'C': [7, 8, 9]}
df = pd.DataFrame(data)
# Transpose the DataFrame
transposed_df = df.transpose()
# Print the transposed DataFrame
print(transposed_df)
Original DataFrame:
```
A B C
0 1 4 7
1 2 5 8
2 3 6 9
```
Transposed DataFrame:
```
0 1 2
A 1 2 3
B 4 5 6
C 7 8 9
```
As you can see, the rows of the original DataFrame become columns in the transposed DataFrame, and vice versa. The column labels in the original DataFrame become the index labels in the transposed DataFrame.
You might use df.transpose() when you want to perform operations or analysis on the data that are more naturally expressed with the rows and columns interchanged. For instance, it can be helpful when the original DataFrame has features (columns) as rows, and you want to analyze them as columns.
Keep in mind that df.transpose() returns a new DataFrame with the transposed data, and it doesn't modify the original DataFrame in place. If you want to update the original DataFrame, you can assign the result back to it:
```python
df = df.transpose() # Assign the transposed DataFrame back to df
```
Additionally, you can use the .T attribute as a shorthand for transposing a DataFrame:
```python
transposed_df = df.T # Equivalent to df.transpose()
Dropping a row from a DataFrame
To drop a row by index from a DataFrame in pandas, you can use the drop() function with the index parameter set to the desired row index. Here's an example:
import pandas as pd
# Assuming you have a DataFrame called 'df'
# Drop a row by index
df = df.drop(index=3)
# Print the updated DataFrame
print(df)
In the code above, df.drop(index=3) drops the row with index 3 from the DataFrame df. The drop() function is called on the DataFrame object and the index parameter is set to the index value you want to remove. The resulting DataFrame is then assigned back to df. Finally, print(df) is used to display the updated DataFrame after dropping the row.
Make sure to replace 'df' with the actual name of your DataFrame. Additionally, you can remove multiple rows by passing a list of index values to the index parameter, like index=[3, 5, 7], to drop multiple rows at once.
Dropping columns from a DataFrame
To drop columns from a DataFrame in Pandas, you can use the .drop() method. Here is an example using the provided DataFrame:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv("/content/sample_data/california_housing_train.csv")
# Drop columns using the .drop() method
df.drop(['column_name_1', 'column_name_2'], axis=1, inplace=True)
# The 'column_name_1' and 'column_name_2' columns have been dropped from the DataFrame
In the example above, replace 'column_name_1' and 'column_name_2' with the actual names of the columns you want to drop from the DataFrame. The axis=1 argument specifies that the columns should be dropped, and inplace=True ensures that the changes are made directly in the DataFrame.
Let's drop some columns from the df we have loaded from california_housing_train.csv:
import pandas as pd
# Read the CSV file into a DataFrame
df = pd.read_csv("/content/sample_data/california_housing_train.csv")
print(df.head())
Let's drop the columns in the red borders below:
# Drop Columns
df.drop(['median_house_value',
'median_income', 'households', 'population',
'housing_median_age'], axis = 1, inplace=True)
#print the result
print(df.head())
Resetting the index
In a Pandas DataFrame, the index refers to the labels assigned to each row. It provides a unique identifier for each row of data, which helps in data alignment and indexing operations.
Resetting the index means reassigning the default numerical index (0, 1, 2, ...) to the DataFrame. This can be useful in scenarios where you have modified the DataFrame and want to discard the current index, or when you want to convert an indexed DataFrame into a simple numerical index.
There are several methods to reset the index of a DataFrame:
Note: In all examples, inplace=True is used to modify the DataFrame directly. If you prefer to create a new DataFrame instead of modifying the original one, you can omit inplace=True.
Let's drop the row at index 2 and see what happens to the data
df.drop(index=2, inplace=True)
print(df.head())
Notice that index 2 is gone above. Now, let;s reset the index
df.reset_index(inplace=True)
print(df.head())
In this case it kept the old index. Let's drop index 3 and reset with the drop option. Also, we will drop the old Index Column to avoid confusion
df.drop('index', axis=1, inplace=True)
df.drop(index=3,inplace=True)
print(df.head())
df.reset_index(drop=True, inplace=True)
print(df.head())
Data Aggregation
Aggregation refers to the process of combining multiple values into a single value, typically by applying a mathematical or statistical function. In the context of pandas DataFrames, aggregation involves grouping data based on certain criteria and performing calculations on those grouped subsets.
Here are some key points to understand about aggregation in pandas:
Aggregation is a powerful technique for summarizing and analyzing data. It allows you to condense large datasets into meaningful insights by calculating summary statistics, identifying patterns, or deriving new features. pandas provides a simple and efficient way to perform aggregation operations on DataFrames, making it an essential tool for data analysis and manipulation.
Example:
Using the following data, calculate total revenue, and average price of items sold in each store:
data = {
'Store': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Item': ['Mango', 'Banana', 'Orange', 'Grape', 'Mango', 'Banana', 'Orange', 'Grape'],
'Price': [60, 30, 20, 70, 45, 11, 60, 30],
'Quantity': [20, 10, 40, 30, 50, 20, 40, 45]
}
df = pd.DataFrame(data)
print(df)
Let's calculate the total revenue and average price of items sold in each store using the provided data:
The output will be:
Store Total Revenue Average Price
0 A 4150 43.75
1 B 2005 35.25
This code calculates the total revenue and average price of items sold in each store using pandas DataFrame and aggregation functions. By grouping the data by the 'Store' column and applying the sum() and mean() aggregation functions, we obtain the desired results.
import pandas as pd
data = {
'Store': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B'],
'Item': ['Mango', 'Banana', 'Orange', 'Grape', 'Mango', 'Banana', 'Orange', 'Grape'],
'Price': [60, 30, 20, 70, 45, 11, 60, 30],
'Quantity': [20, 10, 40, 30, 50, 20, 40, 45]
}
df = pd.DataFrame(data)
df['Revenue'] = df['Price'] * df['Quantity']
store_summary = df.groupby('Store').agg({'Revenue': 'sum', 'Price': 'mean'}).reset_index()
store_summary.rename(columns={'Price': 'Average_Price'}, inplace=True)
print("Store Summary:\n", store_summary)
This code demonstrates how to perform aggregation and data processing on a DataFrame using the pandas library in Python. Let's break down the code step by step:
Additional examples:
store_item_summary = df.groupby(['Store', 'Item']).agg({'Revenue': 'sum', 'Price': 'mean'}).reset_index()
store_item_summary.rename(columns={'Price': 'Average_Price'}, inplace=True)
print("Store-Item Summary:\n", store_item_summary)
store_stats = df.groupby('Store').agg({'Revenue': 'sum', 'Price': ['min', 'max'], 'Quantity': 'max'}).reset_index()
store_stats.columns = ['Store', 'Total_Revenue', 'Min_Price', 'Max_Price', 'Max_Quantity']
print("Store Statistics:\n", store_stats)
These examples demonstrate how to aggregate and process data in a DataFrame using pandas. Aggregation functions like sum, mean, min, max, etc., can be used based on the requirements to derive meaningful insights from the data.
Example 2:
Using trhe following data, calculate total amount spent by each customer:
data = {
'Customer': ['Rany', 'Reem', 'Rany', 'Rany', 'Reem', 'Reem', 'Rany', 'Reem'],
'Item': ['iPad', 'iPhone', 'Monitor', 'Mouse', 'iPad', 'iPhone', 'Monitor', 'Mouse'],
'Price': [10, 5, 50, 20, 10, 5, 50, 20],
'Quantity': [3, 4, 2, 5, 10, 6, 1, 2]
}
df['total_spent'] = df['Price'] * df['Quantity']
print(df)
You can also use the agg() function as follows:
customer_summary = df.groupby('Customer').agg({'total_spent': 'sum'}).reset_index()
Conclusion:
In conclusion, Python Pandas DataFrame is a powerful tool for managing and analyzing data in a tabular form. It allows easy access, manipulation, and analysis of data, providing numerous functions and methods. With its rich functionality and ease of use, Pandas DataFrame is widely used in data science and analysis tasks.
Pandas Dataframe Cheat sheet: https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf