22. Joins#

22.1. Introduction#

It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.

pandas has a really rich set of options for combining one or more data frames, with the two most important being concatenate and merge. Some of the examples in this chapter show you how to join a pair of data frames. Fortunately this is enough, since you can combine three data frames by combining two pairs.

22.1.1. Prerequisites#

This chapter will use the pandas data analysis package.

22.2. Concatenate#

If you have two or more data frames with the same index or the same columns, you can glue them together into a single data frame using pd.concat().

For the same columns, pass axis=0 to glue the index together; for the same index, pass axis=1 to glue the columns together. The concatenate function will typically be used on a list of data frames.

If you want to track where the original data came from in the final data frame, use the keys keyword.

Here’s an example using data on two different states’ populations that also makes uses of the keys option:

import pandas as pd

base_url = "http://www.stata-press.com/data/r14/"
state_codes = ["ca", "il"]
end_url = "pop.dta"

# This grabs the two data frames, one for each state
list_of_state_dfs = [pd.read_stata(base_url + state + end_url) for state in state_codes]
# Show example of first entry in list of data frames
print(list_of_state_dfs[0])

# Concatenate the list of data frames
df = pd.concat(list_of_state_dfs, keys=state_codes, axis=0)
df
        county      pop
0  Los Angeles  9878554
1       Orange  2997033
2      Ventura   798364
county pop
ca 0 Los Angeles 9878554
1 Orange 2997033
2 Ventura 798364
il 0 Cook 5285107
1 DeKalb 103729
2 Will 673586

Note that the keys argument is optional, but is useful for keeping track of origin data frames within the merged data frame.

Exercise

Concatenate the follow two data frames:

df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])

df2 = pd.DataFrame([['c', 3], ['d', 4]],
                   columns=['letter', 'number'])

22.3. Merge#

There are so many options for merging data frames using pd.merge(left, right, on=..., how=... that we won’t be able to cover them all here. The most important features are: the two data frames to be merged, what variables (aka keys) to merge on (and these can be indexes) via on=, and how to do the merge (eg left, right, outer, inner) via how=. This diagram shows an example of a merge using keys from the left-hand data frame:

The how= keyword works in the following ways:

  • how='left' uses keys from the left data frame only to merge.

  • how='right' uses keys from the right data frame only to merge.

  • how='inner' uses keys that appear in both data frames to merge.

  • how='outer' uses the cartesian product of keys in both data frames to merge on.

Let’s see examples of some of these:

left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)
right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)
# Right merge
pd.merge(left, right, on=["key1", "key2"], how="right")
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2
3 K2 K0 NaN NaN C3 D3

Note that the key combination of K2 and K0 did not exist in the left-hand data frame, and so its entries in the final data frame are NaNs. But it does have entries because we chose the keys from the right-hand data frame.

What about an inner merge?

pd.merge(left, right, on=["key1", "key2"], how="inner")
key1 key2 A B C D
0 K0 K0 A0 B0 C0 D0
1 K1 K0 A2 B2 C1 D1
2 K1 K0 A2 B2 C2 D2

Now we see that the combination K2 and K0 are excluded because they didn’t exist in the overlap of keys in both data frames.

Finally, let’s take a look at an outer merge that comes with some extra info via the indicator keyword:

pd.merge(left, right, on=["key1", "key2"], how="outer", indicator=True)
key1 key2 A B C D _merge
0 K0 K0 A0 B0 C0 D0 both
1 K0 K1 A1 B1 NaN NaN left_only
2 K1 K0 A2 B2 C1 D1 both
3 K1 K0 A2 B2 C2 D2 both
4 K2 K1 A3 B3 NaN NaN left_only
5 K2 K0 NaN NaN C3 D3 right_only

Now we can see that the products of all key combinations are here. The indicator=True option has caused an extra column to be added, called ‘_merge’, that tells us which data frame the keys on that row came from.

Exercise

Merge the following two data frames using the left_on and right_on keyword arguments to specify a join on lkey and rkey respectively:

df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})

Exercise

Merge the following two data frames on "a" using how="left" as a keyword argument:

df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})

What do you notice about the position .loc[1, "c"] in the merged data frame?

For more on the options for merging, see pandas’ comprehensive merging documentation.