A Guide to Insert, Update, and Delete Rows in Python Pandas Data Analytics
In the ever-evolving realm of data science, the ability to manipulate and mold data is a fundamental skill. Whether you’re an aspiring data scientist or a seasoned analyst, proficiency in tools like Pandas can make all the difference. Among its many capabilities, Pandas excels at handling tabular data, providing a robust foundation for data manipulation in Python.
This blog aims to delve into the intricacies of efficient data manipulation in Pandas, with a specific focus on inserting, updating, and deleting rows. These operations are the building blocks of data transformation, and mastering them opens the door to a world of possibilities in data analysis and exploration.
So, fasten your seatbelts as we embark on a journey to unravel the secrets of Pandas, exploring techniques that will empower you to wield this powerful tool with finesse. Whether you’re looking to add new insights to your data, update existing records, or clean up unnecessary information, this guide has you covered. Let’s dive into the heart of Pandas and unlock the potential of data manipulation in Python.
Table of Contents
Toggleimport and read in Pandas
import pandas as pd df=pd.read_excel('E:\JOINS\CUSTOMERDATA.xlsx') print(df)
- import pandas as pd
This line imports the Pandas library and aliases it as pd for brevity. The alias is a common convention in the data science community. - Reading Excel File:
df = pd.read_excel(‘E:\JOINS\CUSTOMERDATA.xlsx’)
Here, you use the pd.read_excel() function to read the contents of an Excel file into a Pandas DataFrame (df). The file path ‘E:\JOINS\CUSTOMERDATA.xlsx’ points to the location of your Excel file. - If your Excel file has multiple sheets, you can specify the sheet name using the sheet_name parameter (e.g., pd.read_excel(‘file.xlsx’, sheet_name=’Sheet1′)).
- Displaying the DataFrame:
- print(df)
This line prints the entire DataFrame to the console. This is useful for quickly inspecting the structure and contents of your data.
Note: If your DataFrame is too large, printing it in its entirety may not be practical. In such cases, you can use df.head() to display the first few rows or specify the number of rows to display.
In summary, your code reads an Excel file into a Pandas DataFrame and then prints the DataFrame to the console, allowing you to inspect the data. If you have any specific questions about the DataFrame or if you’d like to perform certain operations on it, feel free to let me know!
Basic Operator in Pandas
# Displaying the current DataFrame print(df) # Selecting Columns names = df['customer_name'] print("\nSelected 'customer_name' column:\n", names) # Filtering Data young_customers = df[df['customer_id'] < 104] print("\nCustomers with 'customer_id' less than 104:\n", young_customers) # Sorting Data sorted_df = df.sort_values(by='customer_name', ascending=False) print("\nDataFrame sorted by 'customer_name' in descending order:\n", sorted_df)
1.Displaying the DataFrame:
print(df) This prints the entire DataFrame, showing all the rows and columns.2.Selecting Columns:
names = df[‘customer_name’] print(“\nSelected ‘customer_name’ column:\n”, names) It extracts the ‘customer_name’ column from the DataFrame and prints it. names is now a Series containing only the ‘customer_name’ column.3.Filtering Data:
young_customers = df[df[‘customer_id’] < 104] print(“\nCustomers with ‘customer_id’ less than 104:\n”, young_customers) It filters the DataFrame to include only the rows where the ‘customer_id’ is less than 104.4. Sorting Data:
sorted_df = df.sort_values(by=’customer_name’, ascending=False) print(“\nDataFrame sorted by ‘customer_name’ in descending order:\n”, sorted_df) It sorts the DataFrame based on the ‘customer_name’ column in descending order and then prints the sorted DataFrame. Seems like you’re manipulating and exploring your data. Anything specific you’re trying to achieve with this DataFrame? OUTPUTcustomer_id customer_name customer_email 0 101 Alice alice@example.com 1 102 Bob bob@example.com 2 103 Charlie charlie@example.com 3 104 David david@example.com 4 105 Eve eve@example.com 5 106 Friday Fri@fmail.com Selected 'customer_name' column: 0 Alice 1 Bob 2 Charlie 3 David 4 Eve 5 Friday Name: customer_name, dtype: object Customers with 'customer_id' less than 104: customer_id customer_name customer_email 0 101 Alice alice@example.com 1 102 Bob bob@example.com 2 103 Charlie charlie@example.com DataFrame sorted by 'customer_name' in descending order: customer_id customer_name customer_email 5 106 Friday Fri@fmail.com 4 105 Eve eve@example.com 3 104 David david@example.com 2 103 Charlie charlie@example.com 1 102 Bob bob@example.com 0 101 Alice alice@example.com
Adding a new row using loc with label-based indexing
new_data_loc = {'customer_id': 110.0, 'customer_name': 'Grace', 'customer_email': 'grace@example.com'} df.loc[len(df)] = new_data_loc
new_data_loc
is a dictionary that contains information about a new customer. It has keys such as ‘customer_id’, ‘customer_name’, and ‘customer_email’, along with their respective values.
df.loc[len(df)]
is selecting a specific location within the DataFrame df. len(df) returns the length of the DataFrame, which corresponds to the index where the new row will be added.
df.loc[len(df)] =
new_data_loc assigns the values in the new_data_loc dictionary to the row in the DataFrame at the index position obtained by len(df). This effectively appends a new row to the DataFrame df with the information provided in the new_data_loc dictionary.
So, the code snippet is taking the dictionary new_data_loc and using it to add a new row to the DataFrame df with the information of a customer named Grace.
Adding a new row using iloc with integer-based indexing
new_data_iloc = {'customer_id': 111.0, 'customer_name': 'Harry', 'customer_email': 'harry@example.com'} df.loc[df.index.max() + 1] = new_data_iloc # Displaying the updated DataFrame print(df)
This code is similar to the previous one but uses df.index.max() + 1 to find the next available index for the new row. Here’s a breakdown:
new_data_iloc
is a dictionary containing details about a new customer with keys ‘customer_id’, ‘customer_name’, and ‘customer_email’, each with their respective values.
df.loc[df.index.max() + 1]
locates the row in the DataFrame df at the next available index. df.index.max() returns the maximum index value in the DataFrame, and adding 1 ensures that the new row is added to the next index.
df.loc[df.index.max() + 1]
= new_data_iloc assigns the values from the new_data_iloc dictionary to the row in the DataFrame df at the index position determined by df.index.max() + 1. This effectively appends a new row to the DataFrame df with the information provided in the new_data_iloc dictionary.
The print(df) statement displays the updated DataFrame with the new row added, showing the customer named Harry with the given details.
This code is adding a new row to the DataFrame df with the information of a customer named Harry. The printed output will show the updated DataFrame with this new addition.
Delete row in data frame
df=df[df['customer_id']!=106] print(df)
This line of code performs a filtering operation on the DataFrame df using a condition to remove rows where the ‘customer_id’ column is equal to 106. Let’s break it down:
df[‘customer_id’] != 106
creates a boolean Series where each row is evaluated to True or False based on whether the ‘customer_id’ for that row is not equal to 106.
df[df[‘customer_id’] != 106]
filters the DataFrame df by selecting only those rows where the condition df[‘customer_id’] != 106 is True. This effectively removes the rows where the ‘customer_id’ is equal to 106 from the DataFrame.
The updated DataFrame is reassigned to the variable df. Now, df contains all the rows where the ‘customer_id’ is not equal to 106.
The result displayed by print(df) will show the DataFrame after removing all rows where the ‘customer_id’ is equal to 106. It filters out those specific rows, leaving only the rows where the ‘customer_id’ is different from 106.
Update row in data frame
df.loc[df['customer_id']==104,'customer_name']='Ramesh' print(df)
This line of code modifies the ‘customer_name’ in the DataFrame df for the rows where the ‘customer_id’
is equal to 104, setting their ‘customer_name’ to ‘Ramesh’. Let’s break it down:
df[‘customer_id’] == 104
creates a boolean Series that checks each row in the DataFrame to identify where the ‘customer_id’ column is equal to 104.
df.loc[df[‘customer_id’] == 104, ‘customer_name’] = ‘Ramesh’ uses the .loc function to locate rows meeting the condition df[‘customer_id’] == 104. It specifically targets the ‘customer_name’ column and sets the value to ‘Ramesh’ for those rows.
So, this code line updates the ‘customer_name’ to ‘Ramesh’ for all rows in the DataFrame df where the ‘customer_id’ is equal to 104. This operation effectively changes the ‘customer_name’ selectively for rows that meet the specified condition. The printed output via print(df) will display the DataFrame after this modification.
Inserting a New Column
ages = [25, 30, 28, 33, 27,22,21] # Inserting 'age' column into the DataFrame df['age'] = ages
To insert a new column ‘age’ in the DataFrame:
Assuming ‘age’ data for each customer
ages = [25, 30, 28, 33, 27]
Inserting ‘age’ column into the DataFrame
df[‘age’] = ages
The code uses this line to add the ‘age’ column to the DataFrame
df
and populates it with the values from theages
list.This operation aligns the ages in the list with the rows of the DataFrame based on their indices. The length of the list should match the number of rows in the DataFrame.
Each age value in the list corresponds to a row in the DataFrame, with the first age (25) being assigned to the first row, the second age (30) to the second row, and so on.
By executing this code, the ‘age’ column will be added to your DataFrame, providing age information aligned with the existing rows of your data.
Deleting a Column:
Deleting columns in a DataFrame is a common operation when you’re working with data in pandas, a popular data manipulation library in Python. You can delete one or more columns using various methods. Here’s how you can do it:
1. Using drop Method
The drop method is versatile and can be used to delete both rows and columns by specifying the axis parameter.
Delete a single column:
import pandas as pd
# Sample DataFrame
df = pd.DataFrame({
‘A’: [1, 2, 3],
‘B’: [4, 5, 6],
‘C’: [7, 8, 9]
})
# Delete column ‘B’
df = df.drop(‘B’, axis=1)
print(df)
2. Using del Keyword
The del keyword is a Python statement that removes an item by its name. It’s straightforward for removing a single column but cannot directly handle multiple columns without a loop or additional logic.
# Delete column 'B' del df['B'] print(df)
Using pop Method
The pop method removes a column and returns it as a series. This can be useful if you want to use the removed column for something else.
# Remove and return column 'B' removed_column = df.pop('B') print(df) print(removed_column)
Selecting Columns to Keep
Instead of deleting unwanted columns, you can select the columns you want to keep. This is not a direct deletion method but effectively achieves the same result.
# Keep only columns 'A' and 'C'
df = df[['A', 'C']]
print(df)