# Data Transformation#

## Introduction#

The previous chapter showed how to access different parts of a data frame and create new rows or columns. In this chapter, you’ll learn how to do more interesting transformations of data including aggregations and groupbys.

This chapter is hugely indebted to the fantastic Python Data Science Handbook by Jake Vanderplas. Remember, if you get stuck with pandas, there is brilliant documentation and a fantastic set of introductory tutorials on the pandas website. These notes are heavily indebted to those introductory tutorials.

### Prerequisites#

This chapter uses the pandas, seaborn, and numpy packages. If you’re running this code, you may need to install these packages, which you can do using either conda install packagename or pip install packagename on your computer’s command line. If you’re running this page in Google Colab, you can install new packages by running !pip install packagename in a new code cell but you will need to restart the Colab notebook for the change to take effect. There’s a brief guide on installing packages in the Chapter on Preliminaries.

import seaborn as sns
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(seed_for_prng)  # prng=probabilistic random number generator


## Aggregation#

Aggregation means taking some data and applying a statistical operation that aggregates the information into a smaller number of statistics. Because aggregations take many numbers and produce, typically, fewer numbers, they involve a function of some kind. While you can always write your own aggregation functions, pandas has some common ones built in—and these can be applied to rows or columns as we’ll see.

The pandas built-in aggregation functions include:

Aggregation

Description

count()

Number of items

first(), last()

First and last item

mean(), median()

Mean and median

min(), max()

Minimum and maximum

std(), var()

Standard deviation and variance

mad()

Mean absolute deviation

prod()

Product of all items

sum()

Sum of all items

value_counts()

Counts of unique values

these can applied to all entries in a dataframe, or optionally to rows or columns using axis=0 or axis=1 (or "rows" or "columns") respectively. Here are a couple of examples using the sum function (note that we aggregate over rows leaving only columns):

df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=list("ABCDE"))
df

A B C D E
0 0 1 2 0 2
1 4 3 1 3 3
2 4 0 3 2 0
df.sum(axis=0)

A    8
B    4
C    6
D    5
E    5
dtype: int64

df.sum(axis="rows")

A    8
B    4
C    6
D    5
E    5
dtype: int64


Exercise

Find the mean values of each column (eg leaving only column headers so aggregating over rows using axis=0) and the mean values of each row (eg leaving only row headers and therefore aggregating over columns using axis=1)

If you want to create custom aggregations, you can of course do that too using the apply method from Working with Data.

## Groupby and then aggregate (aka split, apply, combine)#

Splitting a dataset, applying a function, and combining the results are three key operations that we’ll want to use together again and again. Splitting means differentiating between rows or columns of data based on some conditions, for instance different categories or different values. Applying means applying a function, for example finding the mean or sum. Combine means putting the results of these operations back into the dataframe, or into a variable. The figure gives an example

A very common way of doing this is using the groupby operation to group rows according to come common property. The classic use-case would be to find the mean by a particular characteristic, perhaps the mean height by gender. Note that the ‘combine’ part at the end doesn’t always have to result in a new dataframe; it could create new columns in an existing dataframe (but the new column is the result of a groupby operation, eg subtracting a group mean).

Let’s first see a really simple example of splitting a dataset into groups and finding the mean across those groups using the penguins dataset. We’ll group the data by island and look at the means.

penguins = sns.load_dataset("penguins")

species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
penguins.groupby("island").mean()

bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
island
Biscoe 45.257485 15.874850 209.706587 4716.017964
Dream 44.167742 18.344355 193.072581 3712.903226
Torgersen 38.950980 18.429412 191.196078 3706.372549

The aggregations from the previous part all work on grouped data. An example is df['body_mass_g'].groupby('island').median() for the median body mass by island.

Exercise

Using a groupby, find the standard deviation of different penguin species’ body measurements. (Hint: standard deviation has its own aggregation/combine function given by std())

### Multiple Functions using agg#

You can also pass multiple functions via the agg method (short for aggregation). Here we pass two numpy functions (these two functions have built-in equivalents std() and mean() so this is just for illustrative purposes):

penguins.groupby("species").agg([np.mean, np.std])

/var/folders/x6/ffnr59f116l96_y0q0bjfz7c0000gn/T/ipykernel_12904/4032971144.py:1: FutureWarning: ['island', 'sex'] did not aggregate successfully. If any error is raised this will raise in a future version of pandas. Drop these columns/ops to avoid this warning.
penguins.groupby("species").agg([np.mean, np.std])

bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
mean std mean std mean std mean std
species
Adelie 38.791391 2.663405 18.346358 1.216650 189.953642 6.539457 3700.662252 458.566126
Chinstrap 48.833824 3.339256 18.420588 1.135395 195.823529 7.131894 3733.088235 384.335081
Gentoo 47.504878 3.081857 14.982114 0.981220 217.186992 6.484976 5076.016260 504.116237

Exercise

Using numpy functions, find the max and min of all columns grouping by island.

Multiple aggregations can also be performed at once on the entire dataframe by using a dictionary to map columns into functions. You can also group by as many variables as you like by passing the groupby method a list of variables. Here’s an example that combines both of these features:

penguins.groupby(["species", "island"]).agg({"body_mass_g": "sum", "bill_length_mm": "mean"})

body_mass_g bill_length_mm
species island
Dream 206550.0 38.501786
Torgersen 189025.0 38.950980
Chinstrap Dream 253850.0 48.833824
Gentoo Biscoe 624350.0 47.504878

Exercise

Using a two-column groupby on species and island and an agg, use a dictionary to find the min of bill_depth_mm and the max of flipper_length_mm

Sometimes, inheriting the column names using agg is problematic. There’s a slightly fussy syntax to help with that. It uses tuples, which are values within parenthesis (eg () that have the pattern (column name, function name). To the left of the tuple, you can specify a new name for the resulting column. Here’s an example:

penguins.groupby(["species", "island"]).agg(
count_bill=("bill_length_mm", "count"),
mean_bill=("bill_length_mm", "mean"),
std_flipper=("flipper_length_mm", "std"),
)

count_bill mean_bill std_flipper
species island
Dream 56 38.501786 6.585083
Torgersen 51 38.950980 6.232238
Chinstrap Dream 68 48.833824 7.131894
Gentoo Biscoe 123 47.504878 6.484976

Exercise

Using a two-column groupby on species and island and a named agg, create a new column called mean_flipper that gives the mean of flipper_length_mm.

### Filter#

Filtering does exactly what it sounds like, but it can make use of group-by commands. In the example below, all but one species is filtered out.

In the example below, filter passes a grouped version of the dataframe into the filter_func we’ve defined (imagine that a dataframe is passed for each group). Because the passed variable is a dataframe, and variable x is defined in the function, the x within filter_func body behaves like our dataframe–including having the same columns.

def filter_func(x):
return x["bill_length_mm"].mean() > 48


species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
152 Chinstrap Dream 46.5 17.9 192.0 3500.0 Female
153 Chinstrap Dream 50.0 19.5 196.0 3900.0 Male
154 Chinstrap Dream 51.3 19.2 193.0 3650.0 Male
155 Chinstrap Dream 45.4 18.7 188.0 3525.0 Female
156 Chinstrap Dream 52.7 19.8 197.0 3725.0 Male

### Transform#

Transforms return a transformed version of the data, typically with an intermediate groupby, that has the same shape as the input. Think of it as split-apply-combine but without the ‘combine’ part! This is useful when creating new columns that depend on some grouped data, for instance creating group-wise means. Here’s an example using the datetime group to subtract a yearly mean. First let’s create a synthetic example with some data, a datetime index, and some groups:

index = pd.date_range("1/1/2000", periods=10, freq="Q")
data = np.random.randint(0, 10, (10, 2))
df = pd.DataFrame(data, index=index, columns=["values1", "values2"])
df["type"] = np.random.choice(["group" + str(i) for i in range(3)], 10)
df

values1 values2 type
2000-03-31 1 8 group0
2000-06-30 6 3 group0
2000-09-30 8 4 group0
2000-12-31 0 5 group0
2001-03-31 7 8 group1
2001-06-30 8 1 group2
2001-09-30 9 9 group2
2001-12-31 8 7 group1
2002-03-31 5 1 group2
2002-06-30 5 5 group2

Now we take the yearly means by type. pd.Grouper(freq='A') is an instruction to take the Annual mean using the given datetime index. You can group on as many coloumns and/or index properties as you like: this example groups by a property of the datetime index and on the type column, but performs the computation on the values1 column.

df["v1_demean_yr_and_type"] = df.groupby([pd.Grouper(freq="A"), "type"])[
"values1"
].transform(lambda x: x - x.mean())
df

values1 values2 type v1_demean_yr_and_type
2000-03-31 1 8 group0 -2.75
2000-06-30 6 3 group0 2.25
2000-09-30 8 4 group0 4.25
2000-12-31 0 5 group0 -3.75
2001-03-31 7 8 group1 -0.50
2001-06-30 8 1 group2 -0.50
2001-09-30 9 9 group2 0.50
2001-12-31 8 7 group1 0.50
2002-03-31 5 1 group2 0.00
2002-06-30 5 5 group2 0.00

Exercise

Create a new column that gives values2 normalised (minus the mean and divided by standard deviation) by type.

### Assign#

Assign is a method that allows you to return a new object with all the original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. This is really useful when you want to perform a bunch of operations together in a concise way and keep the original columns. To show it in action, let’s generate some data:

For instance, to demean the ‘values1’ column by year-type and to recompute the ‘val1_times_val2’ column using the newly demeaned ‘values1’ column:

df.assign(
values1=(
df.groupby([pd.Grouper(freq="A"), "type"])["values1"].transform(
lambda x: x - x.mean()
)
),
val1_times_val2=lambda x: x["values1"] * x["values2"],
)

values1 values2 type v1_demean_yr_and_type val1_times_val2
2000-03-31 -2.75 8 group0 -2.75 -22.00
2000-06-30 2.25 3 group0 2.25 6.75
2000-09-30 4.25 4 group0 4.25 17.00
2000-12-31 -3.75 5 group0 -3.75 -18.75
2001-03-31 -0.50 8 group1 -0.50 -4.00
2001-06-30 -0.50 1 group2 -0.50 -0.50
2001-09-30 0.50 9 group2 0.50 4.50
2001-12-31 0.50 7 group1 0.50 3.50
2002-03-31 0.00 1 group2 0.00 0.00
2002-06-30 0.00 5 group2 0.00 0.00

Exercise

Re-write the transform function from the transform exercise (creating a normalised column transformed by type) as an assign statement with a new column name transformed_by_type.

## agg, transform, and apply: when to use each with a groupby#

With all of the different options available, it can be confusing to know when to use the different functions available for performing groupby operations, namely: .agg, .transform, and .apply. Here are the key points to remember:

• Use .agg when using a groupby, but you want your groups to become the new index

• Use .transform when using a groupby, but you want to retain your original index

• Use .apply when using a groupby, but you want to perform operations that will leave neither the original index nor an index of groups

Let’s see an example of all three on a series (pd.Series) to really highlight the differences. First, let’s create the series:

len_s = 1000
s = pd.Series(index=pd.date_range("2000-01-01", periods=len_s, name="date", freq="D"), data=prng.integers(-10, 10, size=len_s))

date
2000-01-01    7
2000-01-02   -4
2000-01-03   -9
2000-01-04    7
2000-01-05    0
Freq: D, dtype: int64


Okay, now we can try these three operations out successively with some example functions. We’ll use skew for the first two, .agg and .transform, in order to highlight that the only difference between these is in the index that is returned. For .agg, using lambda x: x.skew() would return the same as .agg in this case so we’ll opt for a more interesting example: only using values greater than zero and then taking their cumulative sum. Note that what is returned in this third case is an object with a multi-index: the first index tracks the groupby groups while the second tracks the original rows that survived the filtering to values greater than zero.

print("\n.agg following .groupby: groups provide index")
print("\n.transform following .groupby: retain original index")
print("\n.apply following .groupby: index entries can be new")

.agg following .groupby: groups provide index
date
2000-01    0.110284
2000-02    0.223399
2000-03    0.003857
2000-04    0.122812
2000-05   -0.039962
Freq: M, dtype: float64

.transform following .groupby: retain original index
date
2000-01-01    0.110284
2000-01-02    0.110284
2000-01-03    0.110284
2000-01-04    0.110284
2000-01-05    0.110284
Freq: D, dtype: float64

.apply following .groupby: index entries can be new
date     date
2000-01  2000-01-01     7
2000-01-04    14
2000-01-06    23
2000-01-07    29
2000-01-12    38
dtype: int64


## Reshaping Data#

The main options for reshaping data are pivot, melt, stack, unstack, pivot_table, get_dummies, cross_tab, and explode. We’ll look at some of these here.

### Pivoting data from tidy to, err, untidy#

In a previous chapter, we said you should use tidy data–one row per observation, one column per variable–whenever you can. But there are times when you will want to take your lovingly prepared tidy data and pivot it into a wider format. pivot and pivot_table help you to do that.

This can be especially useful for time series data, where operations like shift or diff are typically applied assuming that an entry in one row follows (in time) from the one above. Here’s an example:

data = {
"value": np.random.randn(20),
"variable": ["A"] * 10 + ["B"] * 10,
"category": prng.choice(["type1", "type2", "type3", "type4"], 20),
"date": (
list(pd.date_range("1/1/2000", periods=10, freq="M"))
+ list(pd.date_range("1/1/2000", periods=10, freq="M"))
),
}
df = pd.DataFrame(data, columns=["date", "variable", "category", "value"])
df.sample(5)

date variable category value
5 2000-06-30 A type2 -0.024676
15 2000-06-30 B type1 0.863850
13 2000-04-30 B type3 1.797387
12 2000-03-31 B type1 0.356689
16 2000-07-31 B type3 0.499390

If we just run shift on this, it’s going to shift variable B’s and A’s together even though these overlap in time. So we pivot to a wider format (and then we can shift safely).

df.pivot(index="date", columns="variable", values="value").shift(1)

variable A B
date
2000-01-31 NaN NaN
2000-02-29 -0.700624 -0.989124
2000-03-31 0.582003 0.455839
2000-04-30 0.919882 0.356689
2000-05-31 0.858423 1.797387
2000-06-30 -1.365462 0.558519
2000-07-31 -0.024676 0.863850
2000-08-31 0.593241 0.499390
2000-09-30 -1.425194 2.036025
2000-10-31 0.433021 -0.456338

To go back to the original structure, albeit without the category columns, apply .unstack().reset_index().

Exercise

Perform a pivot that applies to both the variable and category columns. (Hint: remember that you will need to pass multiple objects via a list.)

### Melt#

melt can help you go from untidy to tidy data (from wide data to long data), and is a really good one to remember. Of course, I have to look at the documentation every single time myself, but I’m sure you’ll do better.

Here’s an example of it in action:

df = pd.DataFrame(
{
"first": ["John", "Mary"],
"last": ["Doe", "Bo"],
"job": ["Nurse", "Economist"],
"height": [5.5, 6.0],
"weight": [130, 150],
}
)
print("\n Unmelted: ")
print(df)
print("\n Melted: ")
df.melt(id_vars=["first", "last"], var_name="quantity", value_vars=["height", "weight"])

 Unmelted:
first last        job  height  weight
0  John  Doe      Nurse     5.5     130
1  Mary   Bo  Economist     6.0     150

Melted:

first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0

Exercise

Perform a melt that uses job as the id instead of first and last.

If you don’t wan the headscratching of melt, there’s also wide_to_long, which is really useful for typical data cleaning cases where you have data like this:

df = pd.DataFrame(
{
"A1970": {0: "a", 1: "b", 2: "c"},
"A1980": {0: "d", 1: "e", 2: "f"},
"B1970": {0: 2.5, 1: 1.2, 2: 0.7},
"B1980": {0: 3.2, 1: 1.3, 2: 0.1},
"X": dict(zip(range(3), np.random.randn(3))),
"id": dict(zip(range(3), range(3))),
}
)
df

A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 0.276195 0
1 b e 1.2 1.3 -0.259484 1
2 c f 0.7 0.1 1.434756 2

i.e. data where there are different variables and time periods across the columns. Wide to long is going to let us give info on what the stubnames are (‘A’, ‘B’), the name of the variable that’s always across columns (here, a year), any values (X here), and an id column.

pd.wide_to_long(df, stubnames=["A", "B"], i="id", j="year")

X A B
id year
0 1970 0.276195 a 2.5
1 1970 -0.259484 b 1.2
2 1970 1.434756 c 0.7
0 1980 0.276195 d 3.2
1 1980 -0.259484 e 1.3
2 1980 1.434756 f 0.1

### Stack and unstack#

Stack, stack() is a shortcut for taking a single type of wide data variable from columns and turning it into a long form dataset, but with an extra index.

Unstack, unstack() unsurprisingly does the same operation, but in reverse.

Let’s define a multi-index dataframe to demonstrate this:

tuples = list(
zip(
*[
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
]
)
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df

A B
first second
bar one -0.687516 0.145385
two -0.610041 0.182725
baz one -0.796305 1.674887
two -0.245907 1.796827
foo one -1.791559 -1.211081
two -0.591410 -0.640346
qux one -0.875542 0.900365
two 1.037305 -0.649355

Let’s stack this to create a tidy dataset:

df = df.stack()
df

first  second
bar    one     A   -0.687516
B    0.145385
two     A   -0.610041
B    0.182725
baz    one     A   -0.796305
B    1.674887
two     A   -0.245907
B    1.796827
foo    one     A   -1.791559
B   -1.211081
two     A   -0.591410
B   -0.640346
qux    one     A   -0.875542
B    0.900365
two     A    1.037305
B   -0.649355
dtype: float64


This has automatically created a multi-layered index but that can be reverted to a numbered index using df.reset_index()

Now let’s see unstack but, instead of unstacking the ‘A’, ‘B’ variables we began with, let’s unstack the ‘first’ column by passing level=0 (the default is to unstack the innermost index). This diagram shows what’s going on:

And here’s the code:

df.unstack(level=0)

first bar baz foo qux
second
one A -0.687516 -0.796305 -1.791559 -0.875542
B 0.145385 1.674887 -1.211081 0.900365
two A -0.610041 -0.245907 -0.591410 1.037305
B 0.182725 1.796827 -0.640346 -0.649355

Exercise

What happens if you unstack to level=1 instead? What about applying unstack() twice?

### Get dummies#

This is a really useful reshape command for when you want (explicit) dummies in your dataframe. When running simple regressions, you can achieve the same effect by declaring the column only be included as a fixed effect, but there are some machine learning packages where converting to dummies may be easier.

Here’s an example:

df = pd.DataFrame(
{"group_var": ["group1", "group2", "group3"], "B": ["c", "c", "b"], "C": [1, 2, 3]}
)
print(df)

pd.get_dummies(df, columns=["group_var"])

  group_var  B  C
0    group1  c  1
1    group2  c  2
2    group3  b  3

B C group_var_group1 group_var_group2 group_var_group3
0 c 1 1 0 0
1 c 2 0 1 0
2 b 3 0 0 1

## A quick look at time series and rolling windows#

The support for time series and the datetime type is excellent in pandas and in Python in general; you can find more about this in Introduction to Time and more on how to use time series with pandas in Time Series.

It is very easy to manipulate datetimes with pandas. The relevant part of the documentation has more info; here we’ll just see a couple of the most important bits. First, let’s create some synthetic data to work with:

def recursive_ts(n, x=0.05, beta=0.6, alpha=0.2):
shock = np.random.normal(loc=0, scale=0.6)
if n == 0:
return beta * x + alpha + shock
else:
return beta * recursive_ts(n - 1, x=x) + alpha + shock

t_series = np.cumsum([recursive_ts(n) for n in range(12)])
index = pd.date_range("1/1/2000", periods=12, freq="M")
df = pd.DataFrame(t_series, index=index, columns=["values"])
df.loc["2000-08-31", "values"] = np.nan
df

values
2000-01-31 0.983823
2000-02-29 1.118883
2000-03-31 1.141399
2000-04-30 1.348510
2000-05-31 2.767885
2000-06-30 1.641988
2000-07-31 1.346391
2000-08-31 NaN
2000-09-30 0.932245
2000-10-31 1.571443
2000-11-30 2.524384
2000-12-31 1.716422

Now let’s imagine that there are a number of issues with this time series. First, it’s been recorded wrong: it actually refers to the start of the next month, not the end of the previous as recorded; second, there’s a missing number we want to interpolate; third, we want to take the difference of it to get to something stationary; fourth, we’d like to add a lagged column. We can do all of those things!

# Change freq to next month start
df.index += pd.tseries.offsets.DateOffset(days=1)

# impute the value that is NaN (Not a Number) above
df["values"] = df["values"].interpolate(method="time")

# Take first differences
df["diff_values"] = df["values"].diff(1)

# Create a lag of the first differences
df["lag_diff_values"] = df["diff_values"].shift(1)
df

values diff_values lag_diff_values
2000-02-01 0.983823 NaN NaN
2000-03-01 1.118883 0.135060 NaN
2000-04-01 1.141399 0.022516 0.135060
2000-05-01 1.348510 0.207110 0.022516
2000-06-01 2.767885 1.419376 0.207110
2000-07-01 1.641988 -1.125898 1.419376
2000-08-01 1.346391 -0.295597 -1.125898
2000-09-01 1.135923 -0.210468 -0.295597
2000-10-01 0.932245 -0.203678 -0.210468
2000-11-01 1.571443 0.639198 -0.203678
2000-12-01 2.524384 0.952941 0.639198
2001-01-01 1.716422 -0.807962 0.952941

Having performed these operations, can you see why the "lag_diff_value" column has two entries that are NaN?

Exercise

Add a lead that is 3 months ahead of values.

Two other useful time series functions to be aware of are resample and rolling. resample can upsample or downsample time series. Downsampling is by aggregation, eg df['values].resample('Q').mean() to downsample to quarterly (‘Q’) frequency by taking the mean within each quarter. Upsampling involves a choice about how to fill in the missing values; examples of options are bfill (backfill) and ffill (forwards fill).

There is much more on time series with pandas in the Chapter on Time Series.

Rolling is for taking rolling aggregations, as you’d expect; for example, the 3-month rolling mean of our first difference time series:

df["diff_values"].rolling(3).mean()

2000-02-01         NaN
2000-03-01         NaN
2000-04-01         NaN
2000-05-01    0.121562
2000-06-01    0.549667
2000-07-01    0.166863
2000-08-01   -0.000706
2000-09-01   -0.543987
2000-10-01   -0.236581
2000-11-01    0.075018
2000-12-01    0.462820
2001-01-01    0.261392
Name: diff_values, dtype: float64


Rolling Groupby

Often there are times when you’d like to compute the rolling mean at the group level, for example for each state. Here’s a typical example of this, and how to compute the grouped rolling mean. This example comes from the excellent calmcode website.

First, let’s pick up some data that we might want to apply this to and ensure the "date" column has the datetime datatype:

df = pd.read_csv("https://calmcode.io/datasets/birthdays.csv")
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")

state year month day wday births
date
1969-01-01 AK 1969 1 1 Wed 14
1969-01-01 AL 1969 1 1 Wed 174
1969-01-01 AR 1969 1 1 Wed 78
1969-01-01 AZ 1969 1 1 Wed 84
1969-01-01 CA 1969 1 1 Wed 824

Note that we added a datetime index above; this is because .rolling likes to have a datetime index to work on.

What we’ll do now is proceed in two steps:

1. Group the data with .groupby(). Each grouped set will have an index attached and we’re getting a grouped-series object because we’re only selecting the births column.

2. Use .transform() to perform an operation only within groups.

df["rolling_births"] = df.groupby('state')['births'].transform(lambda x: x.rolling("30D", min_periods=1).mean())

state year month day wday births rolling_births
date
1969-01-01 AK 1969 1 1 Wed 14 14.0
1969-01-01 AL 1969 1 1 Wed 174 174.0
1969-01-01 AR 1969 1 1 Wed 78 78.0
1969-01-01 AZ 1969 1 1 Wed 84 84.0
1969-01-01 CA 1969 1 1 Wed 824 824.0

To check that the result is something sensible, you can always sort the dataframe by date and group. Here, that lets us check that the rolling births are indeed following the births column in the way that we expect.

df = df.sort_index().sort_values(by="state")

state year month day wday births rolling_births
date
1969-01-01 AK 1969 1 1 Wed 14 14.000000
1985-02-20 AK 1985 2 20 Wed 45 34.100000
1981-03-15 AK 1981 3 15 Sun 20 26.633333
1976-10-20 AK 1976 10 20 Wed 16 22.000000
1972-11-12 AK 1972 11 12 Sun 22 18.466667

For a more in-depth look at combining groups with rolling aggregations, take a look at the tutorial on calmcode.

## Method Chaining#

Warning

Method Chaining is a more advanced topic; feel free to skip it.

Sometimes, rather than splitting operations out into multiple lines, it can be more concise and clear to chain methods together. A typical time you might do this is when reading in a dataset and perfoming all of the initial cleaning. Tom Augsperger has a great tutorial on this, which I’ve reproduced parts of here. For more info on the pipe function used below, check out these short video tutorials.

To chain methods together, both the input and output must be a pandas dataframe. Many functions already do input and output these, for example the df.rename(columns={'old_col': 'new_col'}) takes in df and outputs a dataframe with one column name changed.

But occassionally, we’ll want to use a function that we’ve defined (rather than an already existing one). For that, we need the pipe method; it ‘pipes’ the result of one operation to the next operation. When objects are being passed through multiple functions, this can be much clearer. Compare, for example,

f(g(h(df), g_arg=a), f_arg=b)


that is, dataframe df is being passed to function h, and the results of that are being passed to a function g that needs a key word argument g_arg, and the results of that are being passed to a function f that needs keyword argument f_arg. The nested structure is barely readable. Compare this with

(df.pipe(h)
.pipe(g, g_arg=a)
.pipe(f, f_arg=b)
)


Let’s see a method chain in action on a real dataset so you get a feel for it. We’ll use 1,000 rows of flight data from BTS (a popular online dataset for demos of data cleaning!). TODO use github path. (For further info on method chaining in Python, see these videos–but be aware they assume advanced knowledge of the language.)

df = pd.read_csv(
"https://github.com/aeturrell/coding-for-economists/raw/main/data/flights1kBTS.csv",
index_col=0,
)

FL_DATE OP_UNIQUE_CARRIER TAIL_NUM ORIGIN ORIGIN_CITY_NAME DEST DEST_CITY_NAME DEP_TIME ARR_TIME ARR_DELAY CANCELLATION_CODE AIR_TIME DISTANCE Unnamed: 13
403296 2020-01-20 G4 231NV SFB Sanford, FL AVL Asheville, NC 2011.0 2137.0 6.0 NaN 64.0 465.0 NaN
126199 2020-01-06 MQ N687JS ORD Chicago, IL MSN Madison, WI 1837.0 1927.0 -16.0 NaN 26.0 109.0 NaN
417268 2020-01-20 YV N516LR IAD Washington, DC DTW Detroit, MI 1738.0 1915.0 14.0 NaN 60.0 383.0 NaN
67589 2020-01-04 OO N804SK SAV Savannah, GA DTW Detroit, MI 653.0 844.0 -26.0 NaN 91.0 705.0 NaN
102760 2020-01-05 WN N789SW SAN San Diego, CA TUS Tucson, AZ 732.0 941.0 -19.0 NaN 53.0 368.0 NaN

We’ll try and do a number of operations in one go: putting column titles in lower case, discarding useless columns, creating precise depature and arrival times, turning some of the variables into categoricals, creating a demeaned delay time, and creating a new categorical column for distances according to quantiles that will be called ‘near’, ‘less near’, ‘far’, and ‘furthest’. Some of these operations require a separate function, so we first define those. When we do the cleaning, we’ll pipe our dataframe to those functions (optionally passing any arguments).

def extract_city_name(df):
"""
Chicago, IL -> Chicago for origin_city_name and dest_city_name
"""
cols = ["origin_city_name", "dest_city_name"]
city = df[cols].apply(lambda x: x.str.extract("(.*), \w{2}", expand=False))
df = df.copy()
df[["origin_city_name", "dest_city_name"]] = city
return df

def time_to_datetime(df, columns):
"""
Combine all time items into datetimes.

2014-01-01,0914 -> 2014-01-01 09:14:00
"""
df = df.copy()

def converter(col):
timepart = (
col.astype(str)
.str.replace("\.0\$", "")  # NaNs force float dtype
)
return pd.to_datetime(
df["fl_date"]
+ " "
+ timepart.str.slice(0, 2)
+ ":"
+ timepart.str.slice(2, 4),
errors="coerce",
)

df[columns] = df[columns].apply(converter)
return df

df = (
df.drop([x for x in df.columns if "Unnamed" in x], axis=1)
.rename(columns=str.lower)
.pipe(extract_city_name)
.pipe(time_to_datetime, ["dep_time", "arr_time"])
.assign(
fl_date=lambda x: pd.to_datetime(x["fl_date"]),
dest=lambda x: pd.Categorical(x["dest"]),
origin=lambda x: pd.Categorical(x["origin"]),
tail_num=lambda x: pd.Categorical(x["tail_num"]),
arr_delay=lambda x: pd.to_numeric(x["arr_delay"]),
op_unique_carrier=lambda x: pd.Categorical(x["op_unique_carrier"]),
arr_delay_demean=lambda x: x["arr_delay"] - x["arr_delay"].mean(),
distance_group=lambda x: (
pd.qcut(x["distance"], 4, labels=["near", "less near", "far", "furthest"])
),
)
)

/var/folders/x6/ffnr59f116l96_y0q0bjfz7c0000gn/T/ipykernel_12904/1424200124.py:22: FutureWarning: The default value of regex will change from True to False in a future version.