Regression#

Introduction#

In this chapter, you’ll learn how to run linear regressions with code.

If you’re running this code (either by copying and pasting it, or by downloading it using the icons at the top of the page), you may need to install the packages it uses first. There’s a brief guide to installing packages in the chapter on Preliminaries.

Most of this chapter will rely on pyfixest and we are indebted to Alexander Fischer for his efforts in bringing a convenient and flexible way to absorb high dimensional fixed effects to Python via pyfixest. Some of the material in this chapter follows Grant McDermott’s excellent notes and the Library of Statistical Translation.

Quick Review of Econometrics#

Notation#

Greek letters, like \(\beta\), are the truth and represent parameters. Modified Greek letters are an estimate of the truth, for example \(\hat{\beta}\). Sometimes Greek letters will stand in for vectors of parameters. Most of the time, upper case Latin characters such as \(X\) will represent random variables (which could have more than one dimension). Lower case letters from the Latin alphabet denote realised data, for instance \(x\) (which again could be multi-dimensional). Modified Latin alphabet letters denote computations performed on data, for instance \(\bar{x} = \frac{1}{n} \displaystyle\sum_{i} x_i\) where \(n\) is number of samples.

Ordinary least squares (OLS) regression can be used to estimate the parameters of certain types of model, most typically models of the form

\[ y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 . \]

This generic model says that the value of an outcome variable \(y\) is a linear function of one or more input predictor variables \(x_i\), where the \(x_i\) could be transforms of original data. But the above equation is a platonic ideal, what we call a data generating process (DGP). OLS allows us to recover estimates of the parameters of the model , i.e. to find \(\hat{\beta_i}\) and to enable us to write an estimated model:

\[ y = \hat{\beta_0} + \hat{\beta_1} \cdot x_1 + \hat{\beta_2} \cdot x_2 + \epsilon . \]

This equation can also be expressed in matrix form as

\[ y = x'\cdot \hat{\beta} + \epsilon \]

where \(x' = (1, x_1, \dots, x_{n})'\) and \(\hat{\beta} = (\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_{n}})\).

Given data \(y_i\) stacked to make a vector \(y\) and \(x_{i}\) stacked to make a matrix \(X\), this can be solved for the coefficients \(\hat{\beta}\) according to

\[ \hat{\beta} = \left(X'X\right)^{-1} X'y . \]

Gauss-Markov Conditions#

To be sure that the estimates of these parameters are the best linear unbiased estimate, a few conditions need to hold: the Gauss-Markov conditions:

  1. \(y\) is a linear function of the \(\beta_i\)

  2. \(y\) and the \(x_i\) are randomly sampled from the population.

  3. There is no perfect multi-collinearity of variables.

  4. \(\mathbb{E}(\epsilon | x_1, \dots, x_n) = 0\) (unconfoundedness)

  5. \(\text{Var}(\epsilon | x_1, \dots, x_n) = \sigma^2\) (homoskedasticity)

(1)-(4) also guarantee that OLS estimates are unbiased and \(\mathbb{E}(\hat{\beta}_i) = \beta_i\).

The classic linear model requires a 6th assumption; that \(\epsilon \thicksim \mathcal{N}(0, \sigma^2)\).

The interpretation of regression coefficients depends on what their units are to begin with, but you can always work it out by differentiating both sides of the model equation with respect to the \(x_i\). For example, for the first model equation above

\[ \frac{\partial y}{\partial x_i} = \beta_i \]

so we get the interpretation that \(\beta_i\) is the rate of change of y with respect to \(x_i\). If \(x_i\) and \(y\) are in levels, this means that a unit increase in \(x_i\) is associated with a \(\beta_i\) units increase in \(y\). If the right-hand side of the model is \(\ln x_i\) then we get

\[ \frac{\partial y}{\partial x_i} = \beta_i \frac{1}{x_i} \]

with some abuse of notation, we can rewrite this as \(\partial y = \beta_i \partial x_i/x_i\), which says that a percent change in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\). With a logged \(y\) variable, it’s a percent change in \(x_i\) that is associated with a percent change in \(y\), or \(\partial y/y = \beta_i \partial x_i/x_i\) (note that both sides of this equation are unitless in this case). Finally, another example that is important in practice is that of log differences, eg \(y = \beta_i (\ln x_i - \ln x_i')\). Again, we will abuse notation and say that this case may be represented as \(\partial y = \beta_i (\partial x_i/x_i - \partial x_i'/x_i')\), i.e. the difference in two percentages, a percentage point change, in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\).

Imports#

Let’s import some of the packages we’ll be using:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lets_plot import *
import statsmodels.api as sm
import statsmodels.formula.api as smf
from pyfixest.estimation import feols
from pyfixest.summarize import summary

LetsPlot.setup_html()

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(seed_for_prng)  # prng=probabilistic random number generator

Regression basics#

There are two ways to run regressions: passing the data directly as objects, and using formulae. For most situations, the formula API is more convenient and so most of what we’ll cover uses this.

For our workhorse regression package, we’ll be using pyfixest, which has impressive speed.

We’ll use the starwars dataset to run a regression of mass on height for star wars characters. This example borrows very heavily from notes by Grant McDermott. First, let’s bring the dataset in:

df = pd.read_csv(
    "https://github.com/aeturrell/coding-for-economists/raw/main/data/starwars.csv",
    index_col=0,
)
# Look at first few rows
df.head()
name height mass hair_color eye_color gender homeworld species
0 Luke Skywalker 172.0 77.0 blond blue male Tatooine Human
1 C-3PO 167.0 75.0 NaN yellow NaN Tatooine Droid
2 R2-D2 96.0 32.0 NaN red NaN Naboo Droid
3 Darth Vader 202.0 136.0 none yellow male Tatooine Human
4 Leia Organa 150.0 49.0 brown brown female Alderaan Human

Okay, now let’s do a regression using OLS and a formula that says our y-variable is mass and our regressor is height:

results_sw = feols("mass ~ height", data=df)

Well, where are the results!? They’re stored in the object we created. To peek at them we need to call the summary function (and, for easy reading, I’ll print it out too using print)

results_sw.summary()
###

Estimation:  OLS
Dep. var.: mass
Inference:  iid
Observations:  59

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |    2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|---------:|---------:|
| Intercept     |    -13.810 |      111.155 |    -0.124 |      0.902 | -236.393 |  208.773 |
| height        |      0.639 |        0.626 |     1.020 |      0.312 |   -0.615 |    1.892 |
---
RMSE: 166.502   R2: 0.018

What we’re seeing here are really several bits of information glued together. To just grab the coefficient results in a tidy format, use

results_sw.tidy()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
Coefficient
Intercept -13.810314 111.154526 -0.124244 0.901559 -236.393412 208.772785
height 0.638571 0.626058 1.019986 0.312045 -0.615089 1.892231

You’ll have noticed that we got an intercept, even though we didn’t specify one in the formula. pyfixest adds in an intercept by default because, most of the time, you will want one. To turn it off, add a -1 at the end of the formula command, eg in this case you would call feols("mass ~ height -1", data=df).

The fit we got in the case with the intercept was pretty terrible; a low \(R^2\) and both of our confidence intervals are large and contain zero. What’s going on? If there’s one adage in regression that’s always worth paying attention to, it’s always plot your data. Let’s see what’s going on here (using matplotlib for plotting):

fig, ax = plt.subplots()
ax.scatter(df["height"], df["mass"], s=200, alpha=0.8)
ax.annotate(
    "Jabba the Hutt",
    df.iloc[df["mass"].idxmax()][["height", "mass"]],
    xytext=(0, -50),
    textcoords="offset points",
    arrowprops=dict(
        arrowstyle="fancy",
        color="k",
        connectionstyle="arc3,rad=0.3",
    ),
)
ax.set_ylim(0, None)
ax.set_title("Always plot the data", loc="left")
plt.show()
_images/09020f9ee3c268962cddb346c785d4a639c6ac93fd3cd366b0c56030a770cbfa.svg

Oh dear, Jabba’s been on the paddy frogs again, and he’s a bit of different case. When we’re estimating statistical relationships, we have all kinds of choices and should be wary about arbitrary decisions of what to include or exclude in case we fool ourselves about the generality of the relationship we are capturing. Let’s say we knew that we weren’t interested in Hutts though, but only in other species: in that case, it’s fair enough to filter out Jabba and run the regression without this obvious outlier. We’ll exclude any entry that contains the string ‘Jabba’ in the name column:

results_outlier_free = feols(
    "mass ~ height", data=df[~df["name"].str.contains("Jabba")]
)
print(results_outlier_free.summary())
###

Estimation:  OLS
Dep. var.: mass
Inference:  iid
Observations:  58

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |    -32.541 |       12.561 |    -2.591 |      0.012 | -57.703 |   -7.379 |
| height        |      0.621 |        0.071 |     8.785 |      0.000 |   0.480 |    0.763 |
---
RMSE: 18.804   R2: 0.58
None

This looks a lot more healthy. Not only is the model explaining a lot more of the data, but the coefficients are now significant.

Standard errors#

You’ll have seen that there’s a column for the standard error of the estimates in the regression table and a message saying that the covariance type of these is ‘nonrobust’. Let’s say that, instead, we want to use Eicker-Huber-White robust standard errors, aka “HC2” standard errors. We can specify to use these up front standard errors up front in the fit method:

feols("mass ~ height", data=df, vcov="HC2").summary()
###

Estimation:  OLS
Dep. var.: mass
Inference:  HC2
Observations:  59

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |    -13.810 |       23.864 |    -0.579 |      0.565 | -61.596 |   33.976 |
| height        |      0.639 |        0.089 |     7.139 |      0.000 |   0.459 |    0.818 |
---
RMSE: 166.502   R2: 0.018

Or, alternatively, we can go back to our existing results and get new standard errors “on the fly”:

results_sw.vcov("HC2").summary()
###

Estimation:  OLS
Dep. var.: mass
Inference:  HC2
Observations:  59

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |    -13.810 |       23.864 |    -0.579 |      0.565 | -61.596 |   33.976 |
| height        |      0.639 |        0.089 |     7.139 |      0.000 |   0.459 |    0.818 |
---
RMSE: 166.502   R2: 0.018

There are several different types of standard errors available in pyfixest:

  • “iid”, “HC1”, “HC2”, “HC3”

  • “hetero”, for heteroskedasticity and autocorrelation consistent standard errors, for which you may want to also use some keyword arguments

  • clustered standard errors: see below. CRV1 and CRV3 available.

You can find information on all of these here.

For now, let’s look more closely at those last ones: clustered standard errors.

Clustered standard errors#

Often, we know something about the structure of likely errors, namely that they occur in groups. In the below example we use one-way clusters to capture this effect in the errors.

Note that in the below example, we grab a subset of the data for which a set of variables we’re interested in are defined, otherwise the below example would execute with an error because of missing cluster-group values.

xf = df.dropna(subset=["homeworld", "mass", "height", "species"])
feols("mass ~ height", data=xf, vcov={"CRV1": "homeworld"}).summary()
###

Estimation:  OLS
Dep. var.: mass
Inference:  CRV1
Observations:  55

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |     -8.795 |       29.150 |    -0.302 |      0.765 | -67.859 |   50.268 |
| height        |      0.616 |        0.098 |     6.280 |      0.000 |   0.417 |    0.815 |
---
RMSE: 172.239   R2: 0.014

We can add two-way clustering of standard errors using the following:

feols("mass ~ height", data=xf, vcov={"CRV1": "homeworld + species"}).summary()
###

Estimation:  OLS
Dep. var.: mass
Inference:  CRV1
Observations:  55

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |     -8.795 |       30.589 |    -0.288 |      0.776 | -71.357 |   53.766 |
| height        |      0.616 |        0.104 |     5.938 |      0.000 |   0.404 |    0.828 |
---
RMSE: 172.239   R2: 0.014

As you would generally expect, the addition of clustering has increased the standard errors.

Fixed effects and categorical variables#

Fixed effects are a way of allowing the intercept of a regression model to vary freely across individuals or groups. It is, for example, used to control for any individual-specific attributes that do not vary across time in panel data.

Let’s use the ‘mtcars’ dataset to demonstrate this. We’ll read it in and set the datatypes of some of the columns at the same time.

mpg = pd.read_csv(
    "https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv",
    dtype={"model": str, "mpg": float, "hp": float, "disp": float, "cyl": "category"},
)
mpg.head()
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110.0 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110.0 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93.0 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110.0 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175.0 3.15 3.440 17.02 0 0 3 2

Now we have our data in we want to regress mpg (miles per gallon) on hp (horsepower) with fixed effects for cyl (cylinders). Now we could just pop in a formula like this 'mpg ~ hp + cyl' because we took the trouble to declare that cyl was of datatype category when reading it in from the csv file. This means that statsmodels will treat it as a category and use it as a fixed effect by default.

But when I read that formula I get nervous that cyl might not have been processed correctly (ie it could have been read in as a float, which is what it looks like) and it might just be treated as a float (aka a continuous variable) in the regression. Which is not what we want at all. So, to be safe, and make our intentions explicit (even when the data is of type ‘category’), it’s best to use the syntax C(cyl) to ask for a fixed effect.

Here’s a regression which does that:

feols("mpg ~ hp + C(cyl)", data=mpg).summary()
###

Estimation:  OLS
Dep. var.: mpg
Inference:  iid
Observations:  32

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |     28.650 |        1.588 |    18.044 |      0.000 |  25.398 |   31.903 |
| hp            |     -0.024 |        0.015 |    -1.560 |      0.130 |  -0.056 |    0.008 |
| C(cyl)[T.6]   |     -5.968 |        1.639 |    -3.640 |      0.001 |  -9.326 |   -2.610 |
| C(cyl)[T.8]   |     -8.521 |        2.326 |    -3.663 |      0.001 | -13.286 |   -3.756 |
---
RMSE: 2.943   R2: 0.754

We can see here that two of the three possible values of cyl:

mpg["cyl"].unique()
['6', '4', '8']
Categories (3, object): ['4', '6', '8']

have been added as fixed effects regressors. The way that +C(cyl) has been added makes it so that the coefficients given are relative to the coefficient for the intercept. We can turn the intercept off to get a coefficient per unique cyl value:

feols("mpg ~ hp + C(cyl) -1", data=mpg).tidy()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
Coefficient
hp -0.024039 0.015408 -1.560163 1.299540e-01 -0.055600 0.007523
C(cyl)[T.4] 28.650118 1.587787 18.044056 0.000000e+00 25.397684 31.902552
C(cyl)[T.6] 22.682463 2.228049 10.180415 6.476264e-11 18.118512 27.246414
C(cyl)[T.8] 20.129267 3.331419 6.042251 1.633553e-06 13.305166 26.953369

When there is an intercept, the coefficients of fixed effect variables can be interpreted as being the average of \(y\) for that class compared to the excluded classes holding all other categories and variables fixed.

High dimensional fixed effects, aka absorbing regression#

Sometimes, you just have a LOT of fixed effects (and perhaps you don’t particularly care about them individually). A common example is having a large number of firms as part of a panel. Fortunately, there are ways to make regression with high dimensional fixed effects be both fast and concise. (In Stata, this is provided by the reghdfe package.)

Let’s say we have a regression of the form

\[ y_i = x_i\cdot \beta + z_i\cdot \gamma +\epsilon_i \]

where \(y_i\) are observations indexed by \(i\), \(x_i\) are vectors of exogenous variables we care about the coefficients (\(\beta\)), \(z_i\) are vectors of fixed effects we don’t care too much about the coefficients (\(\gamma\)) for, and the \(\epsilon_i\) are errors. Then we can use an absorbing regression to solve for the \(\beta\) while ignoring the \(\gamma\).

Here’s an example using simulated data on workers taken from the docs for a different regression package (linearmodels). Let’s simulate some data first, with two fixed effects (state and firm) alongside the two exogenous variables we’re interested in.

from numpy.random import default_rng

rng = default_rng()  # Random number generator

# Create synthetic input data
nobs = 1_000_000  # No. observations
state_id = rng.integers(50, size=nobs)  # State identifier
firm_id = rng.integers(nobs // 5, size=nobs)  # Firm identifier (mean of 5 workers/firm)
x = rng.standard_normal((nobs, 2))  # Exogenous variables
sim = pd.DataFrame(
    {
        "state_id": pd.Categorical(state_id),
        "firm_id": pd.Categorical(firm_id),
        "exog_0": x[:, 0],
        "exog_1": x[:, 1],
    }
)

# Create synthetic relationship
beta = [1, 3]  # coefficients of interest
state_effects = rng.standard_normal(state_id.max() + 1)
state_effects = state_effects[state_id]  # Generate state fixed effects
firm_effects = rng.standard_normal(firm_id.max() + 1)
firm_effects = firm_effects[firm_id]  # Generate firm fixed effects
eps = rng.standard_normal(nobs)  # Generate errors
# Generate endogeneous outcome variable
sim["y"] = (
    sim["exog_0"] * beta[0]
    + sim["exog_1"] * beta[1]
    + firm_effects
    + state_effects
    + eps
)
sim.head()
state_id firm_id exog_0 exog_1 y
0 33 158141 1.022579 -0.766596 -1.568515
1 10 71833 -0.013286 0.743807 1.416374
2 46 41537 -0.128423 2.213870 8.859534
3 26 172230 1.730427 -1.150454 0.762653
4 41 142970 -0.586274 -0.218835 -4.921580

Now we pass this to pyfixest and with the state_id and firm_id variables entered after a vertical bar:

results_hdfe = feols("y ~ exog_0 + exog_1 | state_id + firm_id", data=sim)
results_hdfe.summary()
###

Estimation:  OLS
Dep. var.: y, Fixed effects: state_id+firm_id
Inference:  CRV1
Observations:  1000000

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| exog_0        |      0.999 |        0.001 |   893.058 |      0.000 |   0.997 |    1.001 |
| exog_1        |      3.001 |        0.001 |  2657.996 |      0.000 |   2.999 |    3.003 |
---
RMSE: 0.896   R2: 0.938   R2 Within: 0.909

So, from our 1,000,000 observations, we have roughly 200,000 fixed effects that have been scooped up and packed away, leaving us with just the coefficients, \(\beta\), on the exogenous variables of interest.

Transformations of regressors#

This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of (potentially non-linear) transform of the original features \(x_i\).

Logs and arcsinh#

You have two options for adding in logs: do them before, or do them in the formula. Doing them before just makes use of standard dataframe operations to declare a new column:

mpg["lnhp"] = np.log(mpg["hp"])
feols("mpg ~ lnhp", data=mpg).tidy()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
Coefficient
Intercept 72.640475 6.004325 12.098024 4.551914e-13 60.378007 84.902943
lnhp -10.764176 1.224304 -8.792079 8.387409e-10 -13.264538 -8.263814

Alternatively, you can specify the log directly in the formula:

results_ln = feols("mpg ~ np.log(hp)", data=mpg)
results_ln.tidy()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
Coefficient
Intercept 72.640475 6.004325 12.098024 4.551914e-13 60.378007 84.902943
np.log(hp) -10.764176 1.224304 -8.792079 8.387409e-10 -13.264538 -8.263814

Clearly, the first method will work for arcsinh(x) and log(x+1), but you can also pass both of these into the formula directly too. (For more on the pros and cons of arcsinh, see Bellemare and Wichman [2020].) Here it is with arcsinh:

feols("mpg ~ np.arcsinh(hp)", data=mpg).tidy()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
Coefficient
Intercept 80.104142 6.849916 11.694179 1.060707e-12 66.114748 94.093536
np.arcsinh(hp) -10.764584 1.224363 -8.791985 8.389380e-10 -13.265067 -8.264100

Interaction terms and powers#

This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of non-linear transform of the original features \(x_i\). Two of the most common transformations that you might want to use are interaction terms and polynomial terms. An example of an interaction term would be

\[ y = \beta_0 + \beta_1 x_1 \cdot x_2 \]

while an example of a polynomial term would be

\[ y = \beta_0 + \beta_1 x_1^2 \]

i.e. the last term enters only after it is multiplied by itself.

One note of warning: the interpretation of the effect of a variable is no longer as simple as was set out at the start of this chapter. To work out what the new interpretation is, the procedure is the same though: just take the derivative. In the case of the interaction model above, the effect of a unit change in \(x_1\) on \(y\) is now going to be a function of \(x_2\). In the case of the polynomial model above, the effect of a unit change in \(x_1\) on \(y\) will be \(2\beta_1 \cdot x_1\). For more on interaction terms, see Balli and Sørensen [2013].

Alright, with all of that preamble out of the way, let’s see how we actual do some of this! Let’s try including a linear and squared term in the regression of mpg on hp making use of the numpy power function:

res_poly = feols("mpg ~ hp + np.power(hp, 2)", data=mpg)
res_poly.tidy()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
Coefficient
Intercept 40.409117 2.740759 14.743766 5.107026e-15 34.803635 46.014600
hp -0.213308 0.034884 -6.114812 1.162972e-06 -0.284654 -0.141963
np.power(hp,2) 0.000421 0.000098 4.274647 1.889240e-04 0.000219 0.000622

Now let’s include the original term in hp, a term in disp, and the interaction between them, which is represented by hp:disp in the table. We can do this by multiplying out the terms:

res_inter = feols("mpg ~ hp * disp", data=mpg)
res_inter.tidy()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
Coefficient
Intercept 39.674263 2.914172 13.614248 7.172041e-14 33.704852 45.643674
hp -0.097894 0.024745 -3.956175 4.725449e-04 -0.148581 -0.047207
disp -0.073373 0.014387 -5.100113 2.109284e-05 -0.102843 -0.043904
hp:disp 0.000290 0.000087 3.336151 2.407037e-03 0.000112 0.000468

In the unusual case that you want only the interaction term, you write it as it appears in the table above:

res_only_inter = feols("mpg ~ hp : disp", data=mpg)
res_only_inter.tidy()
Estimate Std. Error t value Pr(>|t|) 2.5 % 97.5 %
Coefficient
Intercept 25.809854 1.005184 25.676738 0.000000e+00 23.756994 27.862715
hp:disp -0.000142 0.000019 -7.408952 2.958491e-08 -0.000181 -0.000103

The formula API explained#

As you will have seen ~ separates the left- and right-hand sides of the regression. + computes a set union, which will also be familiar from the examples above (ie it includes two terms as long as they are distinct). - computes a set difference; it adds the set of terms to the left of it while removing any that appear on the right of it. As we’ve seen, a*b is a short-hand for a + b + a:b, with the last term representing the interaction. / is short hand for a + a:b, which is useful if, for example b is nested within a, so it doesn’t make sense to control for b on its own. Actually, the : character can interact multiple terms so that (a + b):(d + c) is the same as a:c + a:d + b:c + b:d. C(a) tells statsmodels to treat a as a categorical variable that will be included as a fixed effect. Finally, as we saw above with powers, you can also pass in vectorised functions, such as np.log() and np.power(), directly into the formulae.

One gotcha with the formula API is ensuring that you have sensible variable names in your dataframe, i.e. ones that do not include whitespace or, to take a really pathological example, have the name ‘a + b’ for one of the columns that you want to regress on. You can dodge this kind of problem by passing in the variable name as, for example, Q("a + b") to be clear that the column name is anything within the Q("...").

Multiple regression models#

As is so often the case, you’re likely to want to run more than one model at once with different specifications. There are two ways you may wish to do this.

Comparing a set of models you’ve already created#

In the code above, we created three different models: results_ln, res_poly, and res_inter. Let’s say we now want to compare them in a table: we can! By using pyfixest’s etable() function. (If you need to do this in statsmodels, use summary_col() as explained here or stargazer.)

Let’s see etable in action:

from pyfixest.summarize import etable

etable([results_ln, res_poly, res_inter], type="df")
est1 est2 est3
0 depvar mpg mpg mpg
1 Intercept 72.64*** (6.004) 40.409*** (2.741) 39.674*** (2.914)
2 np.log(hp) -10.764*** (1.224)
... ... ... ... ...
7 R2 0.72 0.756 0.82
8 S.E. type iid iid iid
9 Observations 32 32 32

10 rows × 4 columns

There are a few different type= options, including dataframe, markdown, and latex:

etable([results_ln, res_poly, res_inter], type="md")
                              est1               est2               est3
--------------  ------------------  -----------------  -----------------
depvar                         mpg                mpg                mpg
------------------------------------------------------------------------
Intercept         72.64*** (6.004)  40.409*** (2.741)  39.674*** (2.914)
np.log(hp)      -10.764*** (1.224)
hp                                  -0.213*** (0.035)  -0.098*** (0.025)
np.power(hp,2)                           0.0*** (0.0)
disp                                                   -0.073*** (0.014)
hp:disp                                                      0.0** (0.0)
------------------------------------------------------------------------
------------------------------------------------------------------------
R2                            0.72              0.756               0.82
S.E. type                      iid                iid                iid
Observations                    32                 32                 32
------------------------------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001

And of course the latex one (type="tex") can be written to a file using open('regression.tex', 'w').write(...) where the ellipsis is the relevant etable() command.

Stepwise multiple models#

Sometimes you want to use a formula and add variables in step-wise. There are 4 stepwise functions: sw, sw0, csw, csw0.

Assume you have the following formula: y ~ x1 + sw(x2, x3). The stepwise function sw will estimate the following two models: y ~ x1 + x2 and y ~ x1 + x3. That is, each element in sw() is sequentially, and separately, added to the formula. If you use sw0 in lieu of sw, then the model y ~ x1 would also be estimated: the 0 in the name implies that the model without any stepwise element will also be estimated. This explains how to use sw and sw0.

The prefix c in the remaining two options, csw and csw0, means “cumulative”: each stepwise element is added to the next. That is, y ~ x1 + csw(x2, x3) would cause y ~ x1 + x2 and y ~ x1 + x2 + x3 to be estimated. The 0 has the same meaning, ie it estimates the most basic version. As an example, y ~ x1 + csw0(x2, x3) leads to the following three models being estimated: y ~ x1, y ~ x1 + x2 and y ~ x1 + x2 + x3.

Let’s see some examples (courtesy of the brilliant fixest documentation for the statistical language R). We’ll be using the iris dataset, which comes with the plotly package.

import plotly.express as px
iris = px.data.iris()
iris.columns = ["y", "x1", "x2", "x3", "species", "id"]
iris.head()
y x1 x2 x3 species id
0 5.1 3.5 1.4 0.2 setosa 1
1 4.9 3.0 1.4 0.2 setosa 1
2 4.7 3.2 1.3 0.2 setosa 1
3 4.6 3.1 1.5 0.2 setosa 1
4 5.0 3.6 1.4 0.2 setosa 1

Let’s try a regular (not cumulative) stepwise regression:

feols("y ~ sw(x1, x2, x3)", data=iris).summary()
###

Estimation:  OLS
Dep. var.: y
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |      6.481 |        0.481 |    13.466 |      0.000 |   5.530 |    7.432 |
| x1            |     -0.209 |        0.156 |    -1.339 |      0.183 |  -0.517 |    0.099 |
---
RMSE: 0.82   R2: 0.012
###

Estimation:  OLS
Dep. var.: y
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |      4.306 |        0.078 |    54.895 |      0.000 |   4.151 |    4.461 |
| x2            |      0.409 |        0.019 |    21.646 |      0.000 |   0.372 |    0.446 |
---
RMSE: 0.404   R2: 0.76
###

Estimation:  OLS
Dep. var.: y
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |      4.779 |        0.073 |    65.616 |      0.000 |   4.636 |    4.923 |
| x3            |      0.888 |        0.051 |    17.297 |      0.000 |   0.786 |    0.989 |
---
RMSE: 0.475   R2: 0.669

And a cumulative stepwise regression:

feols("y ~ csw(x1, x2, x3)", data=iris).summary()
###

Estimation:  OLS
Dep. var.: y
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |      6.481 |        0.481 |    13.466 |      0.000 |   5.530 |    7.432 |
| x1            |     -0.209 |        0.156 |    -1.339 |      0.183 |  -0.517 |    0.099 |
---
RMSE: 0.82   R2: 0.012
###

Estimation:  OLS
Dep. var.: y
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |      2.251 |        0.247 |     9.104 |      0.000 |   1.763 |    2.740 |
| x1            |      0.597 |        0.069 |     8.602 |      0.000 |   0.460 |    0.734 |
| x2            |      0.471 |        0.017 |    27.616 |      0.000 |   0.437 |    0.504 |
---
RMSE: 0.33   R2: 0.84
###

Estimation:  OLS
Dep. var.: y
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|---------:|
| Intercept     |      1.845 |        0.250 |     7.368 |      0.000 |   1.350 |    2.340 |
| x1            |      0.655 |        0.067 |     9.823 |      0.000 |   0.523 |    0.787 |
| x2            |      0.711 |        0.057 |    12.560 |      0.000 |   0.599 |    0.823 |
| x3            |     -0.563 |        0.127 |    -4.426 |      0.000 |  -0.814 |   -0.311 |
---
RMSE: 0.31   R2: 0.859

This is a very verbose way to get regression coefficients out! We can get a summary of the info by using etable again, but this time as a method rather than a stand-alone function:

reg_iris_csw = feols("y ~ csw(x1, x2, x3)", data=iris)
reg_iris_csw.etable()
fml y~x1 y~x1+x2 y~x1+x2+x3
Coefficient Intercept x1 Intercept x1 x2 Intercept x1 x2 x3
Estimate 6.481 -0.209 2.251 0.597 0.471 1.845 0.655 0.711 -0.563
Std. Error 0.481 0.156 0.247 0.069 0.017 0.250 0.067 0.057 0.127
t value 13.466 -1.339 9.104 8.602 27.616 7.368 9.823 12.560 -4.426
Pr(>|t|) 0.000 0.183 0.000 0.000 0.000 0.000 0.000 0.000 0.000
2.5 % 5.530 -0.517 1.763 0.460 0.437 1.350 0.523 0.599 -0.814
97.5 % 7.432 0.099 2.740 0.734 0.504 2.340 0.787 0.823 -0.311

Even handier, pyfixest lets you visualise these results (via Lets-Plot):

reg_iris_csw.coefplot().show()

Instrumental variables#

We will again be using pyfixest for instrumental variables (IV) regression. This sub-section is indebted to another package, linearmodels, for the following example.

Recall that a good instrumental variable \(z\) has zero covariance with the error from the regression (which is untestable) and non-zero covariance with the variable of interest (which is).

Recall that in IV regression, we have a model of the form

\[\begin{split} \begin{split}y_i & = x_{1i}\hat{\beta}_1 + x_{2i}\hat{\beta}_2 + \epsilon_i \\ x_{2i} & = z_{1i}\hat{\delta} + z_{2i}\hat{\gamma} + \nu_i\end{split} \end{split}\]

where \(x_{1i}\) is a set of \(k_1\) exogenous regressors and \(x_{2i}\) is a set of \(k_2\) endogenous regressors such that \(\text{Cov}(x_{2i}, \epsilon_i)\neq 0\). This is a problem for the usual OLS assumptions (the right-hand side should be exogenous).

To get around this, in 2-stage least squares IV, we first regress \(x_{2i}\) on instruments that explain \(x_{2i}\) but not \(y_i\), and then regress \(y_i\) only on the predicted/estimated left-hand side from the first regression, ie on \(\hat{x}_{2i}\).

It’s always easiest to see an example, so let’s estimate what might cause (realised) cigarette demand for the 48 continental US states in 1995.

dfiv = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/AER/CigarettesSW.csv",
    dtype={"state": "category", "year": "category"},
).assign(
    rprice=lambda x: x["price"] / x["cpi"],
    rincome=lambda x: x["income"] / x["population"] / x["cpi"],
)
dfiv.head()
rownames state year cpi population packs income tax price taxs rprice rincome
0 1 AL 1985 1.076 3973000 116.486282 46014968 32.500004 102.181671 33.348335 94.964381 10.763866
1 2 AR 1985 1.076 2327000 128.534592 26210736 37.000000 101.474998 37.000000 94.307622 10.468165
2 3 AZ 1985 1.076 3184000 104.522614 43956936 31.000000 108.578751 36.170418 100.909622 12.830456
3 4 CA 1985 1.076 26444000 100.363037 447102816 26.000000 107.837341 32.104000 100.220580 15.713321
4 5 CO 1985 1.076 3209000 112.963539 49466672 31.000000 94.266663 31.000000 87.608425 14.326190

Now we’ll specify the model. It’s going to be in the form dep ~ exog + [endog ~ instruments], where endog will be regressed on instruments and dep will be regressed on both exog and the predicted values of endog.

In this case, the model will be

\[ \text{Price}_i = \hat{\pi}_0 + \hat{\pi}_1 \text{SalesTax}_i + v_i \]

in the first stage regression and

\[ \text{Packs}_i = \hat{\beta}_0 + \hat{\beta}_2\widehat{\text{Price}_i} + \hat{\beta}_1 \text{RealIncome}_i + u_i \]

in the second stage.

results_iv = feols("np.log(packs) ~ np.log(rincome) + 1 | year + state | np.log(rprice) ~ taxs ", data=dfiv, vcov={"CRV1": "year"})
results_iv.summary()
###

Estimation:  IV
Dep. var.: np.log(packs), Fixed effects: year+state
Inference:  CRV1
Observations:  96

| Coefficient     |   Estimate |   Std. Error |               t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:----------------|-----------:|-------------:|----------------------:|-----------:|--------:|---------:|
| np.log(rprice)  |     -1.279 |        0.000 | -6058735433955890.000 |      0.000 |  -1.279 |   -1.279 |
| np.log(rincome) |      0.443 |        0.000 |  3295534227163963.000 |      0.000 |   0.443 |    0.443 |
---

We can compare the IV results against (naive) OLS. First, run the OLS equivalent:

results_cigs_ols = feols("np.log(packs) ~ np.log(rprice) + np.log(rincome) | year + state", data=dfiv, vcov={"CRV1": "year"})
results_cigs_ols.summary()
###

Estimation:  OLS
Dep. var.: np.log(packs), Fixed effects: year+state
Inference:  CRV1
Observations:  96

| Coefficient     |   Estimate |   Std. Error |               t value |   Pr(>|t|) |   2.5 % |   97.5 % |
|:----------------|-----------:|-------------:|----------------------:|-----------:|--------:|---------:|
| np.log(rprice)  |     -1.056 |        0.000 | -6960957983563793.000 |      0.000 |  -1.056 |   -1.056 |
| np.log(rincome) |      0.497 |        0.000 |  5834939776353078.000 |      0.000 |   0.497 |    0.497 |
---
RMSE: 0.044   R2: 0.967   R2 Within: 0.556

Let’s do a proper comparison of these models using etable(). Note the order: the IV results are the second column of data.

etable([results_cigs_ols, results_iv], type="md")
                            est1             est2
---------------  ---------------  ---------------
depvar             np.log(packs)    np.log(packs)
-------------------------------------------------
np.log(rprice)   -1.056*** (0.0)  -1.279*** (0.0)
np.log(rincome)   0.497*** (0.0)   0.443*** (0.0)
-------------------------------------------------
state                          x                x
year                           x                x
-------------------------------------------------
R2                         0.967                -
S.E. type               by: year         by: year
Observations                  96               96
-------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001

Once we take into account the fact that the real price is endogenous to (realised) demand, we find that its coefficient is more negative; i.e. an increase in the real price of cigarettes creates a bigger fall in number of packs bought.