Pandas provides high-performance, in-memory join operations similar to those in SQL databases. These operations allow you to merge multiple DataFrame objects based on common keys or indexes efficiently.
The DataFrame.merge() method in Pandas enables merging of DataFrame or named Series objects using database-style joins. A named Series is treated as a DataFrame with a single named column. Joins can be performed on columns or indexes.
If merging on columns, DataFrame indexes are ignored. If merging on indexes or indexes with columns, then the index remains the same. However, in cross merges (how='cross'), you cannot specify column names for merging.
Below is the syntax of this method −
The key parameters are −
Let’s create two DataFrames and perform merge operations on them.
import pandas as pd
# Creating the first DataFrame
left = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Alex', 'Amy', 'Allen', 'Alice', 'Ayoung'],
'subject_id': ['sub1', 'sub2', 'sub4', 'sub6', 'sub5']
})
# Creating the second DataFrame
right = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'Name': ['Billy', 'Brian', 'Bran', 'Bryce', 'Betty'],
'subject_id': ['sub2', 'sub4', 'sub3', 'sub6', 'sub5']
})
print("Left DataFrame:")
print(left)
print("\nRight DataFrame:")
print(right)
Left DataFrame: id Name subject_id 0 1 Alex sub1 1 2 Amy sub2 2 3 Allen sub4 3 4 Alice sub6 4 5 Ayoung sub5 Right DataFrame: id Name subject_id 0 1 Billy sub2 1 2 Brian sub4 2 3 Bran sub3 3 4 Bryce sub6 4 5 Betty sub5
result = left.merge(right, on='id') print(result)
id Name_x subject_id_x Name_y subject_id_y 0 1 Alex sub1 Billy sub2 1 2 Amy sub2 Brian sub4 2 3 Allen sub4 Bran sub3 3 4 Alice sub6 Bryce sub6 4 5 Ayoung sub5 Betty sub5
result = left.merge(right, on=['id', 'subject_id']) print(result)
id Name_x subject_id Name_y 0 4 Alice sub6 Bryce 1 5 Ayoung sub5 Betty
The how argument determines which keys to include in the resulting DataFrame. If a key combination does not appear in either the left or right DataFrame, the values in the joined table will be NaN.
| Merge Method | SQL Equivalent | Description |
|---|---|---|
| left | LEFT OUTER JOIN | Use keys from left object |
| right | RIGHT OUTER JOIN | Use keys from right object |
| outer | FULL OUTER JOIN | Union of keys from both DataFrames |
| inner | INNER JOIN | Intersection of keys from both DataFrames |
print(left.merge(right, on='subject_id', how='left'))
id_x Name_x subject_id id_y Name_y 0 1 Alex sub1 NaN NaN 1 2 Amy sub2 1.0 Billy 2 3 Allen sub4 2.0 Brian 3 4 Alice sub6 4.0 Bryce 4 5 Ayoung sub5 5.0 Betty
print(left.merge(right, on='subject_id', how='right'))
id_x Name_x subject_id id_y Name_y 0 2.0 Amy sub2 1 Billy 1 3.0 Allen sub4 2 Brian 2 NaN NaN sub3 3 Bran 3 4.0 Alice sub6 4 Bryce 4 5.0 Ayoung sub5 5 Betty
print(left.merge(right, how='outer', on='subject_id'))
id_x Name_x subject_id id_y Name_y 0 1.0 Alex sub1 NaN NaN 1 2.0 Amy sub2 1.0 Billy 2 3.0 Allen sub4 2.0 Brian 3 4.0 Alice sub6 4.0 Bryce 4 5.0 Ayoung sub5 5.0 Betty 5 NaN NaN sub3 3.0 Bran
Joining will be performed on index. Join operation honors the object on which it is called. So, a.join(b) is not equal to b.join(a).
print(left.merge(right, on='subject_id', how='inner'))
id_x Name_x subject_id id_y Name_y 0 2 Amy sub2 1 Billy 1 3 Allen sub4 2 Brian 2 4 Alice sub6 4 Bryce 3 5 Ayoung sub5 5 Betty
Pandas also provides a DataFrame.join() method, which is useful for merging DataFrames based on their index. It works similarly to DataFrame.merge() but is more efficient for index-based operations.
Below is the syntax of this method −
result = left.join(right, lsuffix='_left', rsuffix='_right') print(result)
id_left Name_left subject_id_left id_right Name_right subject_id_right 0 1 Alex sub1 1 Billy sub2 1 2 Amy sub2 2 Brian sub4 2 3 Allen sub4 3 Bran sub3 3 4 Alice sub6 4 Bryce sub6 4 5 Ayoung sub5 5 Betty sub5
