Pandas merge_ordered() – A Simple Guide with Video – Finxter

0
71
Pandas merge_ordered() – A Simple Guide with Video – Finxter


In this tutorial, we will learn about the Pandas function merge_ordered(). This method performs a merge with optional interpolation. It is especially useful for ordered data like time series data.

Syntax and Parameters

Here are the parameters from the official documentation:

Parameter Type Description
left DataFrame
right DataFrame
on label or list Field names to join on. Must be contained in both
DataFrames.
left_on label or list, or array-like Field names to join on in the left DataFrame.
right_on label or list, or array-like Field names to join on in the right DataFrame.
left_by column name or list of
column names
Group the left DataFrame by group columns and
merge piece by piece with the right DataFrame.
right_by column name or list of
column names
Group the right DataFrame by group columns and
merge piece by piece with the left DataFrame.
fill_method {'ffill', None},
default: None
Interpolation method for data.
suffixes list-like, default is
("_x", "_y")
A length-2 sequence where each element is
optionally a string indicating the suffix to add to the overlapping column names in left and right respectively. A value of None instead of a string indicates that the column name from left or right should be left as it is. At least one of the values must not be None.
how {'left', 'right', 'outer', 'inner'},
default 'outer'
left: use keys from left data frame only
right: use keys from right data frame only
outer: use union of keys from both data frames
inner: use intersection of keys from both data frames
Returns Type Description
DataFrame The merged DataFrame output type will the be same
as ‘left’, if it is a subclass of DataFrame.

Basic Example

To get started, we will create two data frames:

import pandas as pd
df1 = pd.DataFrame({
    'Date': ['15/01/2019', '16/01/2019', '17/01/2019', '18/01/2019', 
    '19/01/2019', '20/01/2019'],
    'Price': [16.7, 18.4, 20.0, 19.3, 17.1, 21.2]
    })
print(df1)
Date Price
0 15.01.2019 16.7
1 16.01.2019 18.4
2 17.01.2019 20.0
3 18.01.2019 19.3
4 19.01.2019 17.1
5 20.01.2019 21.2
df2 = pd.DataFrame({
    'Date': ['15/01/2019', '17/01/2019', '18/01/2019', '20/01/2019', 
    '21/01/2019', '22/01/2019'],
    'Price': [14.6, 19.8, 21.9, 20.2, 17.4, 18.0]
})

print(df2)
Date Price
0 15.01.2019 14.6
1 17.01.2019 19.8
2 18.01.2019 21.9
3 20.01.2019 20.2
4 21.01.2019 17.4
5 22.01.2019 18.0

Here, we import the Pandas library as the first step. Then, we create the two data frames “df1” and “df2” which contain a “Date” column and a “Price” column respectively.

Now that we created these data frames, in the next step we can perform our first merge_ordered() operation:

pd.merge_ordered(df1, df2, on='Date')
Date Price_x Price_y
0 15.01.2019 16.7 14.6
1 16.01.2019 18.4 NaN
2 17.01.2019 20.0 19.8
3 18.01.2019 19.3 21.9
4 19.01.2019 17.1 NaN
5 20.01.2019 21.2 20.2
6 21.01.2019 NaN 17.4
7 22.01.2019 NaN 18.0

We apply the merge_ordered() function and put in the two data frames as the first two arguments of the function. That’s because these are the data frames that we want to merge. The third parameter is the “on” parameter. This parameter expects the column or a list of columns that we want to perform the merge on. We choose the “Date” column here.

The outputted data frame is longer than each of the two initial data frames. That’s because, by default, the merge_ordered() function performs a so-called “outer” join. That means, we use the union of keys from both our data frames. Since there are eight unique dates, the resulting data frame has eight rows in total.

We also get two price columns: “Price_x” and “Price_y“. For each date, we get a price from the left data frame (“Price_x“) and the right data frame (“Price_y“). If there is a “NaN” value, that means, for this specific date, we have only one price value. For example, for the "16.01.2019", we do not get a “Price_y” value because this date is only found in the first data frame.

The “fill_method” parameter

As we saw in the example above, there were some missing values labeled with “NaN“:

Date Price_x Price_y
0 15.01.2019 16.7 14.6
1 16.01.2019 18.4 NaN
2 17.01.2019 20.0 19.8
3 18.01.2019 19.3 21.9
4 19.01.2019 17.1 NaN
5 20.01.2019 21.2 20.2
6 21.01.2019 NaN 17.4
7 22.01.2019 NaN 18.0

We can get rid of these “NaN” values by replacing these values with the previous value. We achieve that by applying the “fill_method” parameter and assigning it to “ffill“:

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill')
Date Price_x Price_y
0 15.01.2019 16.7 14.6
1 16.01.2019 18.4 14.6
2 17.01.2019 20.0 19.8
3 18.01.2019 19.3 21.9
4 19.01.2019 17.1 21.9
5 20.01.2019 21.2 20.2
6 21.01.2019 21.2 17.4
7 22.01.2019 21.2 18.0

Now, we do not have any more missing values here. For example, the “NaN” value in the “Price_y” column for the date "16.01.2019" was replaced with the previous value from that column (“14.6“).

If multiple values are missing directly one after the other, all missing values get replaced by the last available value. For example, the last two values from the “Price_x” column were missing. They were both replaced by the value of the third last row which was "21.2".

The “suffixes” parameter

In the previous example, the two price columns were named “Price_x” and “Price_y” by default. However, we can change these labels by applying the “suffixes” parameter:

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', suffixes=['_leftDF', '_rightDF'])
Date Price_leftDF Price_rightDF
0 15.01.2019 16.7 14.6
1 16.01.2019 18.4 14.6
2 17.01.2019 20.0 19.8
3 18.01.2019 19.3 21.9
4 19.01.2019 17.1 21.9
5 20.01.2019 21.2 20.2
6 21.01.2019 21.2 17.4
7 22.01.2019 21.2 18.0

We performed the same merge_ordered() operation as before. But this time, we added the “suffixes” parameter and assigned it a list with the strings “_leftDF” and “_rightDF“. The two price columns in the resulting data frame are now called “Price_leftDF” and “Price_rightDF“.

As the name of the parameter suggests, we only change the suffixes here, not the whole label. That’s why the column labels still say “Price” before the suffixes because the initial column label said “Price” and we only added the suffixes after that label.

The different kinds of joins

As mentioned in the introduction, by default the merge_ordered() function performs an outer join. That means we take the union of keys from both data frames.

But we can change that by using the “how” parameter.

Another type of join is the “inner” join which uses the intersection of keys from both data frames:

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='inner')
Date Price_x Price_y
0 15.01.2019 16.7 14.6
1 17.01.2019 20.0 19.8
2 18.01.2019 19.3 21.9
3 20.01.2019 21.2 20.2

That means, we only get the dates that are found in both data frames.

The remaining two options the “how” parameter provides us with are the “left” join and the “right” join. The left join uses only the keys from the left data frame.

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='left')
Date Price_x Price_y
0 15.01.2019 16.7 14.6
1 16.01.2019 18.4 14.6
2 17.01.2019 20.0 19.8
3 18.01.2019 19.3 21.9
4 19.01.2019 17.1 21.9
5 20.01.2019 21.2 20.2

Whereas the right join only uses the keys from the right data frame.

pd.merge_ordered(df1, df2, on='Date', fill_method='ffill', how='right')
Date Price_x Price_y
0 15.01.2019 16.7 14.6
1 17.01.2019 20.0 19.8
2 18.01.2019 19.3 21.9
3 20.01.2019 21.2 20.2
4 21.01.2019 21.2 17.4
5 22.01.2019 21.2 18.0

Grouping by group columns

For this section, we will modify “df1” a little bit:

df1['Category'] = ['A', 'A', 'A', 'B', 'B', 'B']
print(df1)
Date Price Category
0 15.01.2019 16.7 A
1 16.01.2019 18.4 A
2 17.01.2019 20.0 A
3 18.01.2019 19.3 B
4 19.01.2019 17.1 B
5 20.01.2019 21.2 B

We added a column called “Category” and assigned the categories “A” or “B” to each row.

df2” remains unchanged:

Date Price
0 15.01.2019 14.6
1 17.01.2019 19.8
2 18.01.2019 21.9
3 20.01.2019 20.2
4 21.01.2019 17.4
5 22.01.2019 18.0

Now, we apply the “left_by” parameter and assign it the value column label “Category“:

pd.merge_ordered(df1, df2, on='Date', left_by="Category")
Date Price_x Category Price_y
0 15.01.2019 16.7 A 14.6
1 16.01.2019 18.4 A NaN
2 17.01.2019 20.0 A 19.8
3 18.01.2019 NaN A 21.9
4 20.01.2019 NaN A 20.2
5 21.01.2019 NaN A 17.4
6 22.01.2019 NaN A 18.0
7 15.01.2019 NaN B 14.6
8 17.01.2019 NaN B 19.8
9 18.01.2019 19.3 B 21.9
10 19.01.2019 17.1 B NaN
11 20.01.2019 21.2 B 20.2
12 21.01.2019 NaN B 17.4
13 22.01.2019 NaN B 18.0

This way, we group the left data frame by the “Category” column and merge that piece by piece with the right data frame.

When we look at the resulting data frame, we can observe that, for example, the “Price_x” entry for the date "18.01.2019" in row 3 is “NaN” although there is an entry for that date in “df1“. However, in “df1“, the date is assigned to the category “B“. So, in the merged data frame, the “Price_x” value for the date "18.01.2019" is found in row 9 with category “B“.

If we had a group column in the right data frame, we could do the same with the “right_by” parameter.

Summary

All in all, we learned how to use the Pandas function merge_ordered(). We saw how to apply the various parameters, how to use the different types of joins, and how to group by group columns.

For more tutorials about Pandas, Python libraries, Python in general, or other computer science-related topics, check out the Finxter Blog page.

Happy Coding!



Source link

Leave a reply

Please enter your comment!
Please enter your name here