Joins#

Introduction#

It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.

pandas has a really rich set of options for combining one or more dataframes, with the two most important being concatenate and merge. Some of the examples in this chapter show you how to join a pair of data frames. Fortunately this is enough, since you can combine three data frames by combining two pairs.

Prerequisites#

This chapter will use the pandas data analysis package.

Concatenate#

If you have two or more dataframes with the same index or the same columns, you can glue them together into a single dataframe using pd.concat().

For the same columns, pass axis=0 to glue the index together; for the same index, pass axis=1 to glue the columns together. The concatenate function will typically be used on a list of dataframes.

If you want to track where the original data came from in the final dataframe, use the keys keyword.

Here’s an example using data on two different states’ populations that also makes uses of the keys option:

import pandas as pd

base_url = "http://www.stata-press.com/data/r14/"
state_codes = ["ca", "il"]
end_url = "pop.dta"

# This grabs the two dataframes, one for each state
list_of_state_dfs = [pd.read_stata(base_url + state + end_url) for state in state_codes]
# Show example of first entry in list of dataframes
print(list_of_state_dfs[0])

# Concatenate the list of dataframes
df = pd.concat(list_of_state_dfs, keys=state_codes, axis=0)
df
        county      pop
0  Los Angeles  9878554
1       Orange  2997033
2      Ventura   798364
county pop
ca 0 Los Angeles 9878554
1 Orange 2997033
2 Ventura 798364
il 0 Cook 5285107
1 DeKalb 103729
2 Will 673586

Note that the keys argument is optional, but is useful for keeping track of origin dataframes within the merged dataframe.

Exercise

Concatenate the follow two dataframes:

df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])

df2 = pd.DataFrame([['c', 3], ['d', 4]],
                   columns=['letter', 'number'])

Merge#

There are so many options for merging dataframes using pd.merge(left, right, on=..., how=... that we won’t be able to cover them all here. The most important features are: the two dataframes to be merged, what variables (aka keys) to merge on (and these can be indexes) via on=, and how to do the merge (eg left, right, outer, inner) via how=. This diagram shows an example of a merge using keys from the left-hand dataframe:

The how= keyword works in the following ways:

  • how='left' uses keys from the left dataframe only to merge.

  • how='right' uses keys from the right dataframe only to merge.

  • how='inner' uses keys that appear in both dataframes to merge.

  • how='outer' uses the cartesian product of keys in both dataframes to merge on.

Let’s see examples of some of these:

left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
# Right merge
pd.merge(left, right, on=["key1", "key2"], how="right")
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3

Note that the key combination of K2 and K0 did not exist in the left-hand dataframe, and so its entries in the final dataframe are NaNs. But it does have entries because we chose the keys from the right-hand dataframe.

What about an inner merge?

pd.merge(left, right, on=["key1", "key2"], how="inner")
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2

Now we see that the combination K2 and K0 are excluded because they didn’t exist in the overlap of keys in both dataframes.

Finally, let’s take a look at an outer merge that comes with some extra info via the indicator keyword:

pd.merge(left, right, on=["key1", "key2"], how="outer", indicator=True)
key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
1 K0 K1 A1 B1 NaN NaN left_only
2 K1 K0 A2 B2 C1 D1 both
3 K1 K0 A2 B2 C2 D2 both
4 K2 K1 A3 B3 NaN NaN left_only
5 K2 K0 NaN NaN C3 D3 right_only

Now we can see that the products of all key combinations are here. The indicator=True option has caused an extra column to be added, called ‘_merge’, that tells us which dataframe the keys on that row came from.

Exercise

Merge the following two dataframes using the left_on and right_on keyword arguments to specify a join on lkey and rkey respectively:

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

Exercise

Merge the following two dataframes on "a" using how="left" as a keyword argument:

df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})

What do you notice about the position .loc[1, "c"] in the merged dataframe?

For more on the options for merging, see pandas’ comprehensive merging documentation.