Pandas merge() – A Simple Illustrated Guide with Video – Finxter

0
46
Pandas merge() – A Simple Illustrated Guide with Video – Finxter


In this tutorial, we will learn about the Pandas merge() function. Described in one sentence, the merge() function is used to combine datasets in various ways.

As you go through the tutorial, you can watch the following video guide for ease of understanding:

Syntax and Parameters

pandas.merge(left, right, how='inner', on=None, left_on=None, 
             right_on=None, left_index=False, right_index=False, 
             sort=False, suffixes=('_x', '_y'), copy=True, 
             indicator=False, validate=None)

Here are the parameters from the official documentation:

Parameter Type Description
left DataFrame
right DataFrame or Series Data frame to merge with
how {'left', 'right', 'outer',
'inner', 'cross'}
, default 'inner'
Merge type to perform:
left: only use the keys from the left DataFrame
right: only use the keys from the right DataFrame
outer: use common keys from both DataFrames
inner: use overlap of keys from both DataFrames
cross: cartesian product from both DataFrames
on label or list Column or index level names to join on.
Must be contained in both DataFrames.
left_on label, or list, or
array-like
Column or index level names to join on in the left DataFrame.
right_on label, or list, or
array-like
Column or index level names to join on in the right DataFrame.
left_index bool, default False Use index from the left DataFrame as join key(s).
right_index bool, default False Use index from the right DataFrame as join key(s).
sort bool, default False Sort the join keys lexicographically in the resulting DataFrame. If set to False, the order of the join keys depends on the join type.
suffixes list-like, default is
("_x", "_y")
A length-2 sequence where each element is
optionally a string indicating the suffix to add to overlapping column names in left and right respectively. At least one of the values must not be None.
copy bool, default True If False, avoid a copy if possible.
indicator bool or str, default
False
If set to True, adds a column to the output DataFrame called "_merge" containing information of the source of each row. The column can be given a different name by providing a string as argument.
validate str, optional If used, checks if merge is of a specified type.
“one_to_one” or “1:1”: checks if merged keys are unique in both left and right datasets.
“one_to_many” or “1:m”: checks if merged keys are unique in left dataset.
“many_to_one” or “m:1”: checks if merged keys are unique in right dataset.
“many_to_many” or “m:m”: allowed, but it does not result in checks.

The return value of the merge() function is a DataFrame consisting of the two merged objects.

Basic Example

To get started, we will first create two data frames that we will be merging in several ways throughout this tutorial:

import pandas as pd

df1 = pd.DataFrame({'Player': ['Jeremy', 'Alice', 'Bob', 'John', 'Mary'],
                    'Age': [31, 25, 27, 28, 21]})
df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary'],
                    'Position': ['Pitcher', 'Catcher', 'Center Field'],
                    'Throwing Speed': [71, 80, 81]})
print(df1)
print()
print(df2)

Output:

Player Age
0 Jeremy 31
1 Alice 25
2 Bob 27
3 John 28
4 Mary 21

… and …

Player Position Throwing Speed
0 Alice Pitcher 71
1 John Catcher 80
2 Mary Center Field 81

First, we import the Pandas library. Then we create the two data frames “df1” and “df2“. The first data frame contains the player’s names of a Baseball team, as well as the player’s age.

The second data frame also contains a part of the player’s names of the first data frame and these player’s position and their throwing speed.

We finally output the data frames and see the mentioned information in a compact way.

Now, we apply the merge() function:

pd.merge(df1, df2, on="Player")

Output:

Player Age Position Throwing Speed
0 Alice 25 Pitcher 71
1 John 28 Catcher 80
2 Mary 21 Center Field 81

The first two arguments are the names of the data frames that we want to merge. The third argument is the “on” parameter. The “on” parameter expects the column names to join on and we set it equal to “Player”.

Thus, Pandas merges these data frames on the “Player” column. The merged data frame only contains the players “Alice”, “John”, and “Mary” because these are the only players contained in both data frames. So, “Jeremy” and “Bob” from the first data frame are dropped.

The “left_on” and “right_on” Parameters

For this section, we will modify the data frame “df2” a little bit:

df2 = pd.DataFrame({'Name': ['Alice', 'John', 'Mary'],
                    'Position': ['Pitcher', 'Catcher', 'Center Field'],
                    'Throwing Speed': [71, 80, 81]})
print(df2)

Output:

Name Position Throwing Speed
0 Alice Pitcher 71
1 John Catcher 80
2 Mary Center Field 81

The only difference is that we changed the label of the “Player” column to “Name”.

Now, we want to merge the data frames “df1” and “df2” again. However, we cannot do so by applying the “on” parameter and assigning it to “Player” since “df2” does not have a “Player” column anymore.

Therefore, we use the two parameters “left_on” and “right_on“. We set the “left_on” parameter equal to the column label that we want to use for merging from the first data frame and we do the same with the “right_on” parameter for the second data frame:

pd.merge(df1, df2, left_on="Player", right_on="Name")

Output:

Player Age Name Position Throwing Speed
0 Alice 25 Alice Pitcher 71
1 John 28 John Catcher 80
2 Mary 21 Mary Center Field 81

This way, we can merge data frames by columns with different column labels.

Since the “Player” column and the “Name” column contain the same information, we might want to get rid of one of them:

df = pd.merge(df1, df2, left_on="Player", right_on="Name")
df = df.drop("Name", axis=1)
print(df)

Output:

Player Age Position Throwing Speed
0 Alice 25 Pitcher 71
1 John 28 Catcher 80
2 Mary 21 Center Field 81

We assign the merged data frame to a new variable called “df” and then we apply the drop() function and assign it the “Name” column. The “axis” parameter is set to “1” to state that we want to drop a column and not a row.

The outputted data frame now misses the “Name” column.

Merge Using Different Joins

In this next step, we will learn about the different types of merges and how to apply them using the “how” parameter.

Therefore, we change “df2” again. We rename the “Name” column back to “Player”. Also, we add two new players, “Jane” and “Mick”:

df2 = pd.DataFrame({'Player': ['Alice', 'John', 'Mary', 
                               'Jane', 'Mick'],
                    'Position': ['Pitcher', 'Catcher', 
                                 'Center Field', 'Pitcher', 'Catcher'],
                    'Throwing Speed': [71, 80, 81, 79, 75]})
Player Position Throwing Speed
0 Alice Pitcher 71
1 John Catcher 80
2 Mary Center Field 81
3 Jane Pitcher 79
4 Mick Catcher 75

df1” still looks like this:

print(df1)
Player Age
0 Jeremy 31
1 Alice 25
2 Bob 27
3 John 28
4 Mary 21

We start with the so-called “inner” join.

Here, we use the intersection of keys from both our data frames:

pd.merge(df1, df2, how="inner", on="Player")
Player Age Position Throwing Speed
0 Alice 25 Pitcher 71
1 John 28 Catcher 80
2 Mary 21 Center Field 81

As before, we assign the “on” parameter the value “Player” to specify what column we want to join on. We set the “how” parameter equal to "inner" to state that we want to perform an inner join.

The outputted data frame contains only the players that occur in both data frames. When we compare that merge to the merge we did in the first section, we can see that they are the same. That’s because "inner" is the default value for the “how” parameter.

The next type of merge we are looking at is the “outer” join.

The outer join is the union of keys from both our data frames:

pd.merge(df1, df2, how="outer", on="Player")
Player Age Position Throwing Speed
0 Jeremy 31.0 NaN NaN
1 Alice 25.0 Pitcher 71.0
2 Bob 27.0 NaN NaN
3 John 28.0 Catcher 80.0
4 Mary 21.0 Center Field 81.0
5 Jane NaN Pitcher 79.0
6 Mick NaN Catcher 75.0

The data frame contains all players from both data frames. Bob, for example, has no value for position and throwing speed because he is only contained in “df1” where we don’t get position and throwing speed values. Similarly, Jane does not have an age value here since she is only found in “df2” which does not provide age information.

The next merge type is the "left" join.

Here, we use the keys from the left data frame exclusively:

pd.merge(df1, df2, how="left", on="Player")
Player Age Position Throwing Speed
0 Jeremy 31 NaN NaN
1 Alice 25 Pitcher 71.0
2 Bob 27 NaN NaN
3 John 28 Catcher 80.0
4 Mary 21 Center Field 81.0

This data frame contains all the players from the left data frame which is “df1” in our case. Thus, Jeremy and Bob have no position and throwing speed values.

The "right" join is similar to the left join:

We are using the keys from the right data frame only:

pd.merge(df1, df2, how="right", on="Player")
Player Age Position Throwing Speed
0 Alice 25.0 Pitcher 71
1 John 28.0 Catcher 80
2 Mary 21.0 Center Field 81
3 Jane NaN Pitcher 79
4 Mick NaN Catcher 75

After merging, the data frame contains all the players from the right data frame which is “df2“. That’s why Jane and Mick have no age values here.

The last join, we are learning about, is a bit special. It is called “cross” join and it creates the cartesian product from both data frames while keeping the order of the keys from the left data frame:

pd.merge(df1, df2, how="cross")
Player_x Age Player_y Position Throwing Speed
0 Jeremy 31 Alice Pitcher 71
1 Jeremy 31 John Catcher 80
2 Jeremy 31 Mary Center Field 81
3 Jeremy 31 Jane Pitcher 79
4 Jeremy 31 Mick Catcher 75
5 Alice 25 Alice Pitcher 71
6 Alice 25 John Catcher 80
7 Alice 25 Mary Center Field 81
8 Alice 25 Jane Pitcher 79
9 Alice 25 Mick Catcher 75
10 Bob 27 Alice Pitcher 71
11 Bob 27 John Catcher 80
12 Bob 27 Mary Center Field 81
13 Bob 27 Jane Pitcher 79
14 Bob 27 Mick Catcher 75
15 John 28 Alice Pitcher 71
16 John 28 John Catcher 80
17 John 28 Mary Center Field 81
18 John 28 Jane Pitcher 79
19 John 28 Mick Catcher 75
20 Mary 21 Alice Pitcher 71
21 Mary 21 John Catcher 80
22 Mary 21 Mary Center Field 81
23 Mary 21 Jane Pitcher 79
24 Mary 21 Mick Catcher 75

We can observe that we have two “Player” columns now, “Player_x” and “Player_y”. Each player of “df2” is assigned to each player of “df1“. Since both data frames contain five rows, the resulting data frame now has 25 rows (5×5).

The “indicator” Parameter

When we merge two data frames, it might be useful to gain information about the source of the merge keys, whether they were observed only in the left data frame, only in the right data frame, or in both. Therefore, we use the “indicator” parameter:

pd.merge(df1, df2, how="outer", on="Player", indicator=True)
Player Age Position Throwing Speed _merge
0 Jeremy 31.0 NaN NaN left_only
1 Alice 25.0 Pitcher 71.0 both
2 Bob 27.0 NaN NaN left_only
3 John 28.0 Catcher 80.0 both
4 Mary 21.0 Center Field 81.0 both
5 Jane NaN Pitcher 79.0 right_only
6 Mick NaN Catcher 75.0 right_only

We perform an outer join on the “Player” column and add the “indicator” and we set it to “True“. This way, we get an additional column called “_merge” with the entries “left_only“, “right_only“, and “both“.

For example, Jeremy gets assigned the value “left_only” because he only appears in the left data frame “df1“. And Mary’s “_merge” value is set to “both” because she is found in both data frames.

Summary

In this tutorial, we learned about the Pandas function merge(). We learned how to perform different kinds of merges using the function’s various parameters.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!



Source link

Leave a reply

Please enter your comment!
Please enter your name here