Joins#
Introduction#
It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.
pandas has a really rich set of options for combining one or more dataframes, with the two most important being concatenate and merge. Some of the examples in this chapter show you how to join a pair of data frames. Fortunately this is enough, since you can combine three data frames by combining two pairs.
Prerequisites#
This chapter will use the pandas data analysis package.
Concatenate#
If you have two or more dataframes with the same index or the same columns, you can glue them together into a single dataframe using pd.concat()
.
For the same columns, pass axis=0
to glue the index together; for the same index, pass axis=1
to glue the columns together. The concatenate function will typically be used on a list of dataframes.
If you want to track where the original data came from in the final dataframe, use the keys
keyword.
Here’s an example using data on two different states’ populations that also makes uses of the keys
option:
import pandas as pd
base_url = (
"https://github.com/aeturrell/coding-for-economists/raw/refs/heads/main/data/"
)
state_codes = ["ca", "il"]
end_url = "pop.dta"
# This grabs the two dataframes, one for each state
list_of_state_dfs = [pd.read_stata(base_url + state + end_url) for state in state_codes]
# Show example of first entry in list of dataframes
print(list_of_state_dfs[0])
# Concatenate the list of dataframes
df = pd.concat(list_of_state_dfs, keys=state_codes, axis=0)
df
county pop
0 Los Angeles 9878554
1 Orange 2997033
2 Ventura 798364
county | pop | ||
---|---|---|---|
ca | 0 | Los Angeles | 9878554 |
1 | Orange | 2997033 | |
2 | Ventura | 798364 | |
il | 0 | Cook | 5285107 |
1 | DeKalb | 103729 | |
2 | Will | 673586 |
Note that the keys
argument is optional, but is useful for keeping track of origin dataframes within the merged dataframe.
Exercise
Concatenate the follow two dataframes:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
Merge#
There are so many options for merging dataframes using pd.merge(left, right, on=..., how=...
that we won’t be able to cover them all here. The most important features are: the two dataframes to be merged, what variables (aka keys) to merge on (and these can be indexes) via on=
, and how to do the merge (eg left, right, outer, inner) via how=
. This diagram shows an example of a merge using keys from the left-hand dataframe:
The how=
keyword works in the following ways:
how='left'
uses keys from the left dataframe only to merge.how='right'
uses keys from the right dataframe only to merge.how='inner'
uses keys that appear in both dataframes to merge.how='outer'
uses the cartesian product of keys in both dataframes to merge on.
Let’s see examples of some of these:
left = pd.DataFrame(
{
"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
# Right merge
pd.merge(left, right, on=["key1", "key2"], how="right")
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
3 | K2 | K0 | NaN | NaN | C3 | D3 |
Note that the key combination of K2 and K0 did not exist in the left-hand dataframe, and so its entries in the final dataframe are NaNs. But it does have entries because we chose the keys from the right-hand dataframe.
What about an inner merge?
pd.merge(left, right, on=["key1", "key2"], how="inner")
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
Now we see that the combination K2 and K0 are excluded because they didn’t exist in the overlap of keys in both dataframes.
Finally, let’s take a look at an outer merge that comes with some extra info via the indicator
keyword:
pd.merge(left, right, on=["key1", "key2"], how="outer", indicator=True)
key1 | key2 | A | B | C | D | _merge | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
1 | K0 | K1 | A1 | B1 | NaN | NaN | left_only |
2 | K1 | K0 | A2 | B2 | C1 | D1 | both |
3 | K1 | K0 | A2 | B2 | C2 | D2 | both |
4 | K2 | K0 | NaN | NaN | C3 | D3 | right_only |
5 | K2 | K1 | A3 | B3 | NaN | NaN | left_only |
Now we can see that the products of all key combinations are here. The indicator=True
option has caused an extra column to be added, called ‘_merge’, that tells us which dataframe the keys on that row came from.
Exercise
Merge the following two dataframes using the left_on
and right_on
keyword arguments to specify a join on lkey
and rkey
respectively:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
Exercise
Merge the following two dataframes on "a"
using how="left"
as a keyword argument:
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
What do you notice about the position .loc[1, "c"]
in the merged dataframe?
For more on the options for merging, see pandas’ comprehensive merging documentation.