22. Joins#
22.1. Introduction#
It’s rare that a data analysis involves only a single data frame. Typically you have many data frames, and you must join them together to answer the questions that you’re interested in.
pandas has a really rich set of options for combining one or more data frames, with the two most important being concatenate and merge. Some of the examples in this chapter show you how to join a pair of data frames. Fortunately this is enough, since you can combine three data frames by combining two pairs.
22.1.1. Prerequisites#
This chapter will use the pandas data analysis package.
22.2. Concatenate#
If you have two or more data frames with the same index or the same columns, you can glue them together into a single data frame using pd.concat()
.
For the same columns, pass axis=0
to glue the index together; for the same index, pass axis=1
to glue the columns together. The concatenate function will typically be used on a list of data frames.
If you want to track where the original data came from in the final data frame, use the keys
keyword.
Here’s an example using data on two different states’ populations that also makes uses of the keys
option:
import pandas as pd
base_url = "http://www.stata-press.com/data/r14/"
state_codes = ["ca", "il"]
end_url = "pop.dta"
# This grabs the two data frames, one for each state
list_of_state_dfs = [pd.read_stata(base_url + state + end_url) for state in state_codes]
# Show example of first entry in list of data frames
print(list_of_state_dfs[0])
# Concatenate the list of data frames
df = pd.concat(list_of_state_dfs, keys=state_codes, axis=0)
df
county pop
0 Los Angeles 9878554
1 Orange 2997033
2 Ventura 798364
county | pop | ||
---|---|---|---|
ca | 0 | Los Angeles | 9878554 |
1 | Orange | 2997033 | |
2 | Ventura | 798364 | |
il | 0 | Cook | 5285107 |
1 | DeKalb | 103729 | |
2 | Will | 673586 |
Note that the keys
argument is optional, but is useful for keeping track of origin data frames within the merged data frame.
Exercise
Concatenate the follow two data frames:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
columns=['letter', 'number'])
df2 = pd.DataFrame([['c', 3], ['d', 4]],
columns=['letter', 'number'])
22.3. Merge#
There are so many options for merging data frames using pd.merge(left, right, on=..., how=...
that we won’t be able to cover them all here. The most important features are: the two data frames to be merged, what variables (aka keys) to merge on (and these can be indexes) via on=
, and how to do the merge (eg left, right, outer, inner) via how=
. This diagram shows an example of a merge using keys from the left-hand data frame:
The how=
keyword works in the following ways:
how='left'
uses keys from the left data frame only to merge.how='right'
uses keys from the right data frame only to merge.how='inner'
uses keys that appear in both data frames to merge.how='outer'
uses the cartesian product of keys in both data frames to merge on.
Let’s see examples of some of these:
left = pd.DataFrame(
{
"key1": ["K0", "K0", "K1", "K2"],
"key2": ["K0", "K1", "K0", "K1"],
"A": ["A0", "A1", "A2", "A3"],
"B": ["B0", "B1", "B2", "B3"],
}
)
right = pd.DataFrame(
{
"key1": ["K0", "K1", "K1", "K2"],
"key2": ["K0", "K0", "K0", "K0"],
"C": ["C0", "C1", "C2", "C3"],
"D": ["D0", "D1", "D2", "D3"],
}
)
# Right merge
pd.merge(left, right, on=["key1", "key2"], how="right")
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
3 | K2 | K0 | NaN | NaN | C3 | D3 |
Note that the key combination of K2 and K0 did not exist in the left-hand data frame, and so its entries in the final data frame are NaNs. But it does have entries because we chose the keys from the right-hand data frame.
What about an inner merge?
pd.merge(left, right, on=["key1", "key2"], how="inner")
key1 | key2 | A | B | C | D | |
---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 |
1 | K1 | K0 | A2 | B2 | C1 | D1 |
2 | K1 | K0 | A2 | B2 | C2 | D2 |
Now we see that the combination K2 and K0 are excluded because they didn’t exist in the overlap of keys in both data frames.
Finally, let’s take a look at an outer merge that comes with some extra info via the indicator
keyword:
pd.merge(left, right, on=["key1", "key2"], how="outer", indicator=True)
key1 | key2 | A | B | C | D | _merge | |
---|---|---|---|---|---|---|---|
0 | K0 | K0 | A0 | B0 | C0 | D0 | both |
1 | K0 | K1 | A1 | B1 | NaN | NaN | left_only |
2 | K1 | K0 | A2 | B2 | C1 | D1 | both |
3 | K1 | K0 | A2 | B2 | C2 | D2 | both |
4 | K2 | K1 | A3 | B3 | NaN | NaN | left_only |
5 | K2 | K0 | NaN | NaN | C3 | D3 | right_only |
Now we can see that the products of all key combinations are here. The indicator=True
option has caused an extra column to be added, called ‘_merge’, that tells us which data frame the keys on that row came from.
Exercise
Merge the following two data frames using the left_on
and right_on
keyword arguments to specify a join on lkey
and rkey
respectively:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
'value': [5, 6, 7, 8]})
Exercise
Merge the following two data frames on "a"
using how="left"
as a keyword argument:
df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
What do you notice about the position .loc[1, "c"]
in the merged data frame?
For more on the options for merging, see pandas’ comprehensive merging documentation.