Data Transformation#

Introduction#

The previous chapter showed how to access different parts of a dataframe and create new rows or columns. In this chapter, you’ll learn how to do more interesting transformations of data including aggregations and groupbys.

This chapter is hugely indebted to the fantastic Python Data Science Handbook by Jake Vanderplas. Remember, if you get stuck with pandas, there is brilliant documentation and a fantastic set of introductory tutorials on the pandas website. These notes are heavily indebted to those introductory tutorials.

Prerequisites#

This chapter uses the pandas and numpy packages. If you’re running this code, you may need to install these packages, which you can do using either conda install packagename or pip install packagename on your computer’s command line. If you’re running this page in Google Colab, you can install new packages by running !pip install packagename in a new code cell but you will need to restart the Colab notebook for the change to take effect. There’s a brief guide on installing packages in the Chapter on Preliminaries. Remember, you only need to install them once!

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(seed_for_prng)  # prng=probabilistic random number generator

Aggregation#

Aggregation means taking some data and applying a statistical operation that aggregates the information into a smaller number of statistics. Because aggregations take many numbers and produce, typically, fewer numbers, they involve a function of some kind. While you can always write your own aggregation functions, pandas has some common ones built in—and these can be applied to rows or columns as we’ll see.

The pandas built-in aggregation functions include:

Aggregation

Description

count()

Number of items

first(), last()

First and last item

mean(), median()

Mean and median

min(), max()

Minimum and maximum

std(), var()

Standard deviation and variance

mad()

Mean absolute deviation

prod()

Product of all items

sum()

Sum of all items

value_counts()

Counts of unique values

these can applied to all entries in a dataframe, or optionally to rows or columns using axis=0 or axis=1 (or "rows" or "columns") respectively. Here are a couple of examples using the sum() function (note that we aggregate over rows leaving only columns):

df = pd.DataFrame(np.random.randint(0, 5, (3, 5)), columns=list("ABCDE"))
df
A B C D E
0 4 2 2 3 2
1 0 3 1 1 4
2 3 3 3 4 0
df.sum(axis=0)
A    7
B    8
C    6
D    8
E    6
dtype: int64
df.sum(axis="rows")
A    7
B    8
C    6
D    8
E    6
dtype: int64

Exercise

Find the mean values of each column (eg leaving only column headers so aggregating over rows using axis=0) and the mean values of each row (eg leaving only row headers and therefore aggregating over columns using axis=1)

If you want to create custom aggregations, you can of course do that too using the apply() method from Working with Data.

Groupby and then aggregate (aka split, apply, combine)#

Splitting a dataset, applying a function, and combining the results are three key operations that we’ll want to use together again and again. Splitting means differentiating between rows or columns of data based on some conditions, for instance different categories or different values. Applying means applying a function, for example finding the mean or sum. Combine means putting the results of these operations back into the dataframe, or into a variable. The figure gives an example

A very common way of doing this is using the groupby() operation to group rows according to come common property. The classic use-case would be to find the mean by a particular characteristic, perhaps the mean height by gender. Note that the ‘combine’ part at the end doesn’t always have to result in a new dataframe; it could create new columns in an existing dataframe (but the new column is the result of a groupby operation, eg subtracting a group mean).

Let’s first see a really simple example of splitting a dataset into groups and finding the mean across those groups using the penguins dataset. We’ll group the data by island and look at the means.

penguins = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
penguins["sex"] = penguins["sex"].str.title()
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female
penguins.groupby("island").mean(numeric_only=True)
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
island
Biscoe 45.257485 15.874850 209.706587 4716.017964
Dream 44.167742 18.344355 193.072581 3712.903226
Torgersen 38.950980 18.429412 191.196078 3706.372549

The aggregations from the previous part all work on grouped data. An example is df['body_mass_g'].groupby('island').median() for the median body mass by island.

Exercise

Using a groupby, find the standard deviation of different penguin species’ body measurements. (Hint: standard deviation has its own aggregation/combine function given by std())

Multiple Functions using agg()#

You can also pass multiple functions via the agg() method (short for aggregation). Here we pass two numpy functions (these two functions have built-in equivalents std() and mean() so this is just for illustrative purposes):

penguins.groupby("species")[penguins.select_dtypes("number").columns].agg(["mean", "std"])
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
mean std mean std mean std mean std
species
Adelie 38.791391 2.663405 18.346358 1.216650 189.953642 6.539457 3700.662252 458.566126
Chinstrap 48.833824 3.339256 18.420588 1.135395 195.823529 7.131894 3733.088235 384.335081
Gentoo 47.504878 3.081857 14.982114 0.981220 217.186992 6.484976 5076.016260 504.116237

Exercise

Trying the obvious keywords, find the max and min of all columns grouping by island.

Multiple aggregations can also be performed at once on the entire dataframe by using a dictionary to map columns into functions. You can also group by as many variables as you like by passing the groupby method a list of variables. Here’s an example that combines both of these features:

penguins.groupby(["species", "island"]).agg({"body_mass_g": "sum", "bill_length_mm": "mean"})
body_mass_g bill_length_mm
species island
Adelie Biscoe 163225.0 38.975000
Dream 206550.0 38.501786
Torgersen 189025.0 38.950980
Chinstrap Dream 253850.0 48.833824
Gentoo Biscoe 624350.0 47.504878

Exercise

Using a two-column groupby() on species and island and an agg, use a dictionary to find the min of bill_depth_mm and the max of flipper_length_mm

Sometimes, inheriting the column names using agg() is problematic. There’s a slightly fussy syntax to help with that. It uses tuples, which are values within parenthesis (eg () that have the pattern (column name, function name). To the left of the tuple, you can specify a new name for the resulting column. Here’s an example:

penguins.groupby(["species", "island"]).agg(
    count_bill=("bill_length_mm", "count"),
    mean_bill=("bill_length_mm", "mean"),
    std_flipper=("flipper_length_mm", "std"),
)
count_bill mean_bill std_flipper
species island
Adelie Biscoe 44 38.975000 6.729247
Dream 56 38.501786 6.585083
Torgersen 51 38.950980 6.232238
Chinstrap Dream 68 48.833824 7.131894
Gentoo Biscoe 123 47.504878 6.484976

Exercise

Using a two-column groupby() on species and island and a named agg(), create a new column called mean_flipper that gives the mean of flipper_length_mm.

Filter#

Filtering does exactly what it sounds like, but it can make use of group-by commands. In the example below, all but one species is filtered out.

In the example below, filter() passes a grouped version of the dataframe into the filter_func() we’ve defined (imagine that a dataframe is passed for each group). Because the passed variable is a dataframe, and variable x is defined in the function, the x within filter_func() body behaves like our dataframe–including having the same columns.

def filter_func(x):
    return x["bill_length_mm"].mean() > 48


penguins.groupby("species").filter(filter_func).head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
152 Chinstrap Dream 46.5 17.9 192.0 3500.0 Female
153 Chinstrap Dream 50.0 19.5 196.0 3900.0 Male
154 Chinstrap Dream 51.3 19.2 193.0 3650.0 Male
155 Chinstrap Dream 45.4 18.7 188.0 3525.0 Female
156 Chinstrap Dream 52.7 19.8 197.0 3725.0 Male

Transform#

Transforms return a transformed version of the data that has the same shape as the input (typically with an intermediate groupby). Think of it as split-apply-combine but without the ‘combine’ part! This is useful when creating new columns that depend on some grouped data, for instance creating group-wise means. Below is a synthetic example using the datetime group to subtract a yearly mean. We’ll create our synthetic example using some data, a datetime index, and some groups:

index = pd.date_range("1/1/2000", periods=10, freq="Q")
data = np.random.randint(0, 10, (10, 2))
df = pd.DataFrame(data, index=index, columns=["values1", "values2"])
df["type"] = np.random.choice(["group" + str(i) for i in range(3)], 10)
df
values1 values2 type
2000-03-31 2 0 group2
2000-06-30 3 3 group2
2000-09-30 7 3 group0
2000-12-31 3 8 group1
2001-03-31 3 0 group1
2001-06-30 2 7 group0
2001-09-30 4 5 group2
2001-12-31 1 2 group1
2002-03-31 8 1 group0
2002-06-30 9 1 group1

Now we take the yearly means by type. pd.Grouper(freq='A') is an instruction to take the Annual mean using the given datetime index. You can group on as many coloumns and/or index properties as you like: this example groups by a property of the datetime index and on the type column, but performs the computation on the values1 column.

df["v1_demean_yr_and_type"] = df.groupby([pd.Grouper(freq="A"), "type"])[
    "values1"
].transform(lambda x: x - x.mean())
df
values1 values2 type v1_demean_yr_and_type
2000-03-31 2 0 group2 -0.5
2000-06-30 3 3 group2 0.5
2000-09-30 7 3 group0 0.0
2000-12-31 3 8 group1 0.0
2001-03-31 3 0 group1 1.0
2001-06-30 2 7 group0 0.0
2001-09-30 4 5 group2 0.0
2001-12-31 1 2 group1 -1.0
2002-03-31 8 1 group0 0.0
2002-06-30 9 1 group1 0.0

Exercise

Create a new column that gives values2 normalised (minus the mean and divided by standard deviation) by type.

Assign#

Assign is a method that allows you to return a new object with all the original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. This is really useful when you want to perform a bunch of operations together in a concise way and keep the original columns. To show it in action, let’s generate some data:

For instance, to demean the ‘values1’ column by year-type and to recompute the ‘val1_times_val2’ column using the newly demeaned ‘values1’ column:

df.assign(
    values1=(
        df.groupby([pd.Grouper(freq="A"), "type"])["values1"].transform(
            lambda x: x - x.mean()
        )
    ),
    val1_times_val2=lambda x: x["values1"] * x["values2"],
)
values1 values2 type v1_demean_yr_and_type val1_times_val2
2000-03-31 -0.5 0 group2 -0.5 -0.0
2000-06-30 0.5 3 group2 0.5 1.5
2000-09-30 0.0 3 group0 0.0 0.0
2000-12-31 0.0 8 group1 0.0 0.0
2001-03-31 1.0 0 group1 1.0 0.0
2001-06-30 0.0 7 group0 0.0 0.0
2001-09-30 0.0 5 group2 0.0 0.0
2001-12-31 -1.0 2 group1 -1.0 -2.0
2002-03-31 0.0 1 group0 0.0 0.0
2002-06-30 0.0 1 group1 0.0 0.0

Exercise

Re-write the transform function from the transform exercise (creating a normalised column transformed by type) as an assign statement with a new column name transformed_by_type.

agg(), transform(), and apply(): when to use each with a groupby#

With all of the different options available, it can be confusing to know when to use the different functions available for performing groupby operations, namely: .agg(), .transform(), and .apply(). Here are the key points to remember:

  • Use .agg() when using a groupby, but you want your groups to become the new index

  • Use .transform() when using a groupby, but you want to retain your original index

  • Use .apply() when using a groupby, but you want to perform operations that will leave neither the original index nor an index of groups

Let’s see an example of all three on a series (pd.Series()) to really highlight the differences. First, let’s create the series:

len_s = 1000
s = pd.Series(index=pd.date_range("2000-01-01", periods=len_s, name="date", freq="D"), data=prng.integers(-10, 10, size=len_s))
s.head()
date
2000-01-01    7
2000-01-02   -4
2000-01-03   -9
2000-01-04    7
2000-01-05    0
Freq: D, dtype: int64

Okay, now we can try these three operations out successively with some example functions. We’ll use skew for the first two, .agg() and .transform, in order to highlight that the only difference between these is in the index that is returned. For .agg(), using lambda x: x.skew() would return the same as .agg() in this case so we’ll opt for a more interesting example: only using values greater than zero and then taking their cumulative sum. Note that what is returned in this third case is an object with a multi-index: the first index tracks the groupby groups while the second tracks the original rows that survived the filtering to values greater than zero.

print("\n`.agg` following `.groupby`: groups provide index")
print(s.groupby(s.index.to_period("M")).agg("skew").head())
print("\n`.transform` following `.groupby`: retain original index")
print(s.groupby(s.index.to_period("M")).transform("skew").head())
print("\n`.apply` following `.groupby`: index entries can be new")
print(s.groupby(s.index.to_period("M")).apply(lambda x: x[x>0].cumsum()).head())
`.agg` following `.groupby`: groups provide index
date
2000-01    0.110284
2000-02    0.223399
2000-03    0.003857
2000-04    0.122812
2000-05   -0.039962
Freq: M, dtype: float64

`.transform` following `.groupby`: retain original index
date
2000-01-01    0.110284
2000-01-02    0.110284
2000-01-03    0.110284
2000-01-04    0.110284
2000-01-05    0.110284
Freq: D, dtype: float64

`.apply` following `.groupby`: index entries can be new
date     date      
2000-01  2000-01-01     7
         2000-01-04    14
         2000-01-06    23
         2000-01-07    29
         2000-01-12    38
dtype: int64

Reshaping Data#

The main options for reshaping data are pivot(), melt(), stack(), unstack(), pivot_table(), get_dummies(), cross_tab(), and explode(). We’ll look at some of these here.

Pivoting data from tidy to, err, untidy#

In a previous chapter, we said you should use tidy data–one row per observation, one column per variable–whenever you can. But there are times when you will want to take your lovingly prepared tidy data and pivot it into a wider format. pivot() and pivot_table() help you to do that.

This can be especially useful for time series data, where operations like shift() or diff() are typically applied assuming that an entry in one row follows (in time) from the one above. Here’s an example:

data = {
    "value": np.random.randn(20),
    "variable": ["A"] * 10 + ["B"] * 10,
    "category": prng.choice(["type1", "type2", "type3", "type4"], 20),
    "date": (
        list(pd.date_range("1/1/2000", periods=10, freq="M"))
        + list(pd.date_range("1/1/2000", periods=10, freq="M"))
    ),
}
df = pd.DataFrame(data, columns=["date", "variable", "category", "value"])
df.sample(5)
date variable category value
8 2000-09-30 A type1 0.487944
9 2000-10-31 A type1 0.087052
14 2000-05-31 B type3 2.332000
17 2000-08-31 B type1 0.898014
11 2000-02-29 B type1 0.642219

If we just run shift() on this, it’s going to shift variable B’s and A’s together even though these overlap in time. So we pivot to a wider format (and then we can shift safely).

df.pivot(index="date", columns="variable", values="value").shift(1)
variable A B
date
2000-01-31 NaN NaN
2000-02-29 0.112438 0.299416
2000-03-31 1.847952 0.642219
2000-04-30 -0.856429 0.118734
2000-05-31 1.179597 -0.802559
2000-06-30 0.199138 2.332000
2000-07-31 0.976504 0.309620
2000-08-31 -1.462122 0.093544
2000-09-30 -0.022126 0.898014
2000-10-31 0.487944 0.632336

To go back to the original structure, albeit without the category columns, apply .unstack().reset_index().

Exercise

Perform a pivot that applies to both the variable and category columns. (Hint: remember that you will need to pass multiple objects via a list.)

Melt#

melt() can help you go from untidy to tidy data (from wide data to long data), and is a really good one to remember. Of course, if you’re a normal human, you’ll have to look back at the documentation most times you use this!

Here’s an example of it in action:

df = pd.DataFrame(
    {
        "first": ["John", "Mary"],
        "last": ["Doe", "Bo"],
        "job": ["Nurse", "Economist"],
        "height": [5.5, 6.0],
        "weight": [130, 150],
    }
)
print("\n Unmelted: ")
print(df)
print("\n Melted: ")
df.melt(id_vars=["first", "last"], var_name="quantity", value_vars=["height", "weight"])
 Unmelted: 
  first last        job  height  weight
0  John  Doe      Nurse     5.5     130
1  Mary   Bo  Economist     6.0     150

 Melted: 
first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0

Exercise

Perform a melt that uses job as the id instead of first and last.

If you don’t wan the headscratching of melt, there’s also wide_to_long(), which is really useful for typical data cleaning cases where you have data like this:

df = pd.DataFrame(
    {
        "A1970": {0: "a", 1: "b", 2: "c"},
        "A1980": {0: "d", 1: "e", 2: "f"},
        "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
        "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
        "X": dict(zip(range(3), np.random.randn(3))),
        "id": dict(zip(range(3), range(3))),
    }
)
df
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -0.077470 0
1 b e 1.2 1.3 -1.095643 1
2 c f 0.7 0.1 0.881495 2

i.e. data where there are different variables and time periods across the columns. Wide to long is going to let us give info on what the stubnames are (‘A’, ‘B’), the name of the variable that’s always across columns (here, a year), any values (X here), and an id column.

pd.wide_to_long(df, stubnames=["A", "B"], i="id", j="year")
X A B
id year
0 1970 -0.077470 a 2.5
1 1970 -1.095643 b 1.2
2 1970 0.881495 c 0.7
0 1980 -0.077470 d 3.2
1 1980 -1.095643 e 1.3
2 1980 0.881495 f 0.1

Stack and unstack#

Stack, stack() is a shortcut for taking a single type of wide data variable from columns and turning it into a long form dataset, but with an extra index.

Unstack, unstack() unsurprisingly does the same operation, but in reverse.

Let’s define a multi-index dataframe to demonstrate this:

tuples = list(
    zip(
        *[
            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df
A B
first second
bar one -0.336068 0.658063
two -0.505635 -0.386031
baz one 0.407233 0.488460
two -0.514403 -0.429508
foo one -0.163958 0.392422
two -0.175950 -0.170050
qux one 0.187251 0.363253
two -1.458928 0.654461

Let’s stack this to create a tidy dataset:

df = df.stack()
df
first  second   
bar    one     A   -0.336068
               B    0.658063
       two     A   -0.505635
               B   -0.386031
baz    one     A    0.407233
               B    0.488460
       two     A   -0.514403
               B   -0.429508
foo    one     A   -0.163958
               B    0.392422
       two     A   -0.175950
               B   -0.170050
qux    one     A    0.187251
               B    0.363253
       two     A   -1.458928
               B    0.654461
dtype: float64

This has automatically created a multi-layered index but that can be reverted to a numbered index using df.reset_index()

Now let’s see unstack but, instead of unstacking the ‘A’, ‘B’ variables we began with, let’s unstack the ‘first’ column by passing level=0 (the default is to unstack the innermost index). This diagram shows what’s going on:

And here’s the code:

df.unstack(level=0)
first bar baz foo qux
second
one A -0.336068 0.407233 -0.163958 0.187251
B 0.658063 0.488460 0.392422 0.363253
two A -0.505635 -0.514403 -0.175950 -1.458928
B -0.386031 -0.429508 -0.170050 0.654461

Exercise

What happens if you unstack to level=1 instead? What about applying unstack() twice?

Get dummies#

This is a really useful reshape command for when you want (explicit) dummies in your dataframe. When running simple regressions, you can achieve the same effect by declaring the column only be included as a fixed effect, but there are some machine learning packages where converting to dummies may be easier.

Here’s an example:

df = pd.DataFrame(
    {"group_var": ["group1", "group2", "group3"], "B": ["c", "c", "b"], "C": [1, 2, 3]}
)
print(df)

pd.get_dummies(df, columns=["group_var"])
  group_var  B  C
0    group1  c  1
1    group2  c  2
2    group3  b  3
B C group_var_group1 group_var_group2 group_var_group3
0 c 1 True False False
1 c 2 False True False
2 b 3 False False True

A quick look at time series and rolling windows#

The support for time series and the datetime type is excellent in pandas and in Python in general; you can find more about this in Introduction to Time and more on how to use time series with pandas in Time Series.

It is very easy to manipulate datetimes with pandas. The relevant part of the documentation has more info; here we’ll just see a couple of the most important bits. First, let’s create some synthetic data to work with:

def recursive_ts(n, x=0.05, beta=0.6, alpha=0.2):
    shock = np.random.normal(loc=0, scale=0.6)
    if n == 0:
        return beta * x + alpha + shock
    else:
        return beta * recursive_ts(n - 1, x=x) + alpha + shock


t_series = np.cumsum([recursive_ts(n) for n in range(12)])
index = pd.date_range("1/1/2000", periods=12, freq="M")
df = pd.DataFrame(t_series, index=index, columns=["values"])
df.loc["2000-08-31", "values"] = np.nan
df
values
2000-01-31 0.721550
2000-02-29 0.369618
2000-03-31 0.845094
2000-04-30 1.522360
2000-05-31 1.201723
2000-06-30 1.705018
2000-07-31 1.466429
2000-08-31 NaN
2000-09-30 2.278682
2000-10-31 1.664481
2000-11-30 2.084541
2000-12-31 2.037873

Now let’s imagine that there are a number of issues with this time series. First, it’s been recorded wrong: it actually refers to the start of the next month, not the end of the previous as recorded; second, there’s a missing number we want to interpolate; third, we want to take the difference of it to get to something stationary; fourth, we’d like to add a lagged column. We can do all of those things!

# Change freq to next month start
df.index += pd.tseries.offsets.DateOffset(days=1)

# impute the value that is NaN (Not a Number) above
df["values"] = df["values"].interpolate(method="time")

# Take first differences
df["diff_values"] = df["values"].diff(1)

# Create a lag of the first differences
df["lag_diff_values"] = df["diff_values"].shift(1)
df
values diff_values lag_diff_values
2000-02-01 0.721550 NaN NaN
2000-03-01 0.369618 -0.351932 NaN
2000-04-01 0.845094 0.475476 -0.351932
2000-05-01 1.522360 0.677266 0.475476
2000-06-01 1.201723 -0.320637 0.677266
2000-07-01 1.705018 0.503296 -0.320637
2000-08-01 1.466429 -0.238589 0.503296
2000-09-01 1.879213 0.412784 -0.238589
2000-10-01 2.278682 0.399468 0.412784
2000-11-01 1.664481 -0.614200 0.399468
2000-12-01 2.084541 0.420060 -0.614200
2001-01-01 2.037873 -0.046669 0.420060

Having performed these operations, can you see why the "lag_diff_value" column has two entries that are NaN?

Exercise

Add a lead that is 3 months ahead of values.

Two other useful time series functions to be aware of are resample() and rolling(). resample() can upsample or downsample time series. Downsampling is by aggregation, eg df['values].resample('Q').mean() to downsample to quarterly (‘Q’) frequency by taking the mean within each quarter. Upsampling involves a choice about how to fill in the missing values; examples of options are bfill() (backfill) and ffill() (forwards fill).

Rolling is for taking rolling aggregations, as you’d expect; for example, the 3-month rolling mean of our first difference time series:

df["diff_values"].rolling(3).mean()
2000-02-01         NaN
2000-03-01         NaN
2000-04-01         NaN
2000-05-01    0.266937
2000-06-01    0.277368
2000-07-01    0.286642
2000-08-01   -0.018644
2000-09-01    0.225830
2000-10-01    0.191221
2000-11-01    0.066017
2000-12-01    0.068443
2001-01-01   -0.080270
Name: diff_values, dtype: float64

Rolling Groupby

Often there are times when you’d like to compute the rolling mean at the group level, for example for each state. Here’s a typical example of this, and how to compute the grouped rolling mean. This example comes from the excellent calmcode website.

First, let’s pick up some data that we might want to apply this to and ensure the "date" column has the datetime datatype:

df = pd.read_csv("https://calmcode.io/datasets/birthdays.csv")
df["date"] = pd.to_datetime(df["date"])
df = df.set_index("date")
df.head()
state year month day wday births
date
1969-01-01 AK 1969 1 1 Wed 14
1969-01-01 AL 1969 1 1 Wed 174
1969-01-01 AR 1969 1 1 Wed 78
1969-01-01 AZ 1969 1 1 Wed 84
1969-01-01 CA 1969 1 1 Wed 824

Note that we added a datetime index above; this is because .rolling() likes to have a datetime index to work on.

What we’ll do now is proceed in two steps:

  1. Group the data with .groupby(). Each grouped set will have an index attached and we’re getting a grouped-series object because we’re only selecting the births column.

  2. Use .transform() to perform an operation only within groups.

df["rolling_births"] = df.groupby('state')['births'].transform(lambda x: x.rolling("30D", min_periods=1).mean())
df.head()
state year month day wday births rolling_births
date
1969-01-01 AK 1969 1 1 Wed 14 14.0
1969-01-01 AL 1969 1 1 Wed 174 174.0
1969-01-01 AR 1969 1 1 Wed 78 78.0
1969-01-01 AZ 1969 1 1 Wed 84 84.0
1969-01-01 CA 1969 1 1 Wed 824 824.0

To check that the result is something sensible, you can always sort the dataframe by date and group. Here, that lets us check that the rolling births are indeed following the births column in the way that we expect.

df = df.sort_index().sort_values(by="state")
df.head()
state year month day wday births rolling_births
date
1969-01-01 AK 1969 1 1 Wed 14 14.000000
1985-02-20 AK 1985 2 20 Wed 45 34.100000
1981-03-15 AK 1981 3 15 Sun 20 26.633333
1976-10-20 AK 1976 10 20 Wed 16 22.000000
1972-11-12 AK 1972 11 12 Sun 22 18.466667

For a more in-depth look at combining groups with rolling aggregations, take a look at the tutorial on calmcode.

There is much more on time series with pandas in the Chapter on Time Series.

Method Chaining#

Warning

Method Chaining is a more advanced topic; feel free to skip it.

Sometimes, rather than splitting operations out into multiple lines, it can be more concise and clear to chain methods together. A typical time you might do this is when reading in a dataset and perfoming all of the initial cleaning. Tom Augsperger has a great tutorial on this, which I’ve reproduced parts of here. For more info on the pipe function used below, check out these short video tutorials.

To chain methods together, both the input and output must be a pandas dataframe. Many functions already do input and output these, for example the df.rename(columns={'old_col': 'new_col'}) takes in df and outputs a dataframe with one column name changed.

But occasionally, we’ll want to use a function that we’ve defined (rather than an already existing one). For that, we need the pipe method; it ‘pipes’ the result of one operation to the next operation. When objects are being passed through multiple functions, this can be much clearer. Compare, for example,

f(g(h(df), g_arg=a), f_arg=b)

that is, dataframe df is being passed to function h, and the results of that are being passed to a function g that needs a key word argument g_arg, and the results of that are being passed to a function f that needs keyword argument f_arg. The nested structure is barely readable. Compare this with

(df.pipe(h)
   .pipe(g, g_arg=a)
   .pipe(f, f_arg=b)
)  

Let’s see a method chain in action on a real dataset so you get a feel for it. We’ll use 1,000 rows of flight data from BTS (a popular online dataset for demos of data cleaning!). (For further info on method chaining in Python, see these videos–but be aware they assume advanced knowledge of the language.)

df = pd.read_csv(
    "https://github.com/aeturrell/coding-for-economists/raw/main/data/flights1kBTS.csv",
    index_col=0,
)
df.head()
FL_DATE OP_UNIQUE_CARRIER TAIL_NUM ORIGIN ORIGIN_CITY_NAME DEST DEST_CITY_NAME DEP_TIME ARR_TIME ARR_DELAY CANCELLATION_CODE AIR_TIME DISTANCE Unnamed: 13
403296 2020-01-20 G4 231NV SFB Sanford, FL AVL Asheville, NC 2011.0 2137.0 6.0 NaN 64.0 465.0 NaN
126199 2020-01-06 MQ N687JS ORD Chicago, IL MSN Madison, WI 1837.0 1927.0 -16.0 NaN 26.0 109.0 NaN
417268 2020-01-20 YV N516LR IAD Washington, DC DTW Detroit, MI 1738.0 1915.0 14.0 NaN 60.0 383.0 NaN
67589 2020-01-04 OO N804SK SAV Savannah, GA DTW Detroit, MI 653.0 844.0 -26.0 NaN 91.0 705.0 NaN
102760 2020-01-05 WN N789SW SAN San Diego, CA TUS Tucson, AZ 732.0 941.0 -19.0 NaN 53.0 368.0 NaN

We’ll try and do a number of operations in one go: putting column titles in lower case, discarding useless columns, creating precise departure and arrival times, turning some of the variables into categoricals, creating a demeaned delay time, and creating a new categorical column for distances according to quantiles that will be called ‘near’, ‘less near’, ‘far’, and ‘furthest’. Some of these operations require a separate function, so we first define those. When we do the cleaning, we’ll pipe our dataframe to those functions (optionally passing any arguments).

def extract_city_name(df):
    """
    Chicago, IL -> Chicago for origin_city_name and dest_city_name
    """
    cols = ["origin_city_name", "dest_city_name"]
    city = df[cols].apply(lambda x: x.str.extract("(.*), \w{2}", expand=False))
    df = df.copy()
    df[["origin_city_name", "dest_city_name"]] = city
    return df


def time_to_datetime(df, columns):
    """
    Combine all time items into datetimes.

    2014-01-01,0914 -> 2014-01-01 09:14:00
    """
    df = df.copy()

    def converter(col):
        timepart = (
            col.astype(str)
            .str.replace("\.0$", "", regex=True)  # NaNs force float dtype
            .str.pad(4, fillchar="0")
        )
        return pd.to_datetime(
            df["fl_date"]
            + " "
            + timepart.str.slice(0, 2)
            + ":"
            + timepart.str.slice(2, 4),
            errors="coerce",
        )

    df[columns] = df[columns].apply(converter)
    return df


df = (
    df.drop([x for x in df.columns if "Unnamed" in x], axis=1)
    .rename(columns=str.lower)
    .pipe(extract_city_name)
    .pipe(time_to_datetime, ["dep_time", "arr_time"])
    .assign(
        fl_date=lambda x: pd.to_datetime(x["fl_date"]),
        dest=lambda x: pd.Categorical(x["dest"]),
        origin=lambda x: pd.Categorical(x["origin"]),
        tail_num=lambda x: pd.Categorical(x["tail_num"]),
        arr_delay=lambda x: pd.to_numeric(x["arr_delay"]),
        op_unique_carrier=lambda x: pd.Categorical(x["op_unique_carrier"]),
        arr_delay_demean=lambda x: x["arr_delay"] - x["arr_delay"].mean(),
        distance_group=lambda x: (
            pd.qcut(x["distance"], 4, labels=["near", "less near", "far", "furthest"])
        ),
    )
)
df.head()
fl_date op_unique_carrier tail_num origin origin_city_name dest dest_city_name dep_time arr_time arr_delay cancellation_code air_time distance arr_delay_demean distance_group
403296 2020-01-20 G4 231NV SFB Sanford AVL Asheville 2020-01-20 20:11:00 2020-01-20 21:37:00 6.0 NaN 64.0 465.0 8.544995 less near
126199 2020-01-06 MQ N687JS ORD Chicago MSN Madison 2020-01-06 18:37:00 2020-01-06 19:27:00 -16.0 NaN 26.0 109.0 -13.455005 near
417268 2020-01-20 YV N516LR IAD Washington DTW Detroit 2020-01-20 17:38:00 2020-01-20 19:15:00 14.0 NaN 60.0 383.0 16.544995 less near
67589 2020-01-04 OO N804SK SAV Savannah DTW Detroit 2020-01-04 06:53:00 2020-01-04 08:44:00 -26.0 NaN 91.0 705.0 -23.455005 far
102760 2020-01-05 WN N789SW SAN San Diego TUS Tucson 2020-01-05 07:32:00 2020-01-05 09:41:00 -19.0 NaN 53.0 368.0 -16.455005 less near