Regression

Regression#

Introduction#

In this chapter, you’ll learn how to run linear regressions with code.

If you’re running this code (either by copying and pasting it, or by downloading it using the icons at the top of the page), you may need to install the packages it uses first. There’s a brief guide to installing packages in the chapter on Preliminaries.

Most of this chapter will rely on pyfixest and we are indebted to Alexander Fischer for his efforts in bringing a convenient and flexible way to absorb high dimensional fixed effects to Python via pyfixest. Some of the material in this chapter follows Grant McDermott’s excellent notes and the Library of Statistical Translation.

Quick Review of Econometrics#

Notation#

Greek letters, like \(\beta\), are the truth and represent parameters. Modified Greek letters are an estimate of the truth, for example \(\hat{\beta}\). Sometimes Greek letters will stand in for vectors of parameters. Most of the time, upper case Latin characters such as \(X\) will represent random variables (which could have more than one dimension). Lower case letters from the Latin alphabet denote realised data, for instance \(x\) (which again could be multi-dimensional). Modified Latin alphabet letters denote computations performed on data, for instance \(\bar{x} = \frac{1}{n} \displaystyle\sum_{i} x_i\) where \(n\) is number of samples.

Ordinary least squares (OLS) regression can be used to estimate the parameters of certain types of model, most typically models of the form

\[ y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 . \]

This generic model says that the value of an outcome variable \(y\) is a linear function of one or more input predictor variables \(x_i\), where the \(x_i\) could be transforms of original data. But the above equation is a platonic ideal, what we call a data generating process (DGP). OLS allows us to recover estimates of the parameters of the model , i.e. to find \(\hat{\beta_i}\) and to enable us to write an estimated model:

\[ y = \hat{\beta_0} + \hat{\beta_1} \cdot x_1 + \hat{\beta_2} \cdot x_2 + \epsilon . \]

This equation can also be expressed in matrix form as

\[ y = x'\cdot \hat{\beta} + \epsilon \]

where \(x' = (1, x_1, \dots, x_{n})'\) and \(\hat{\beta} = (\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_{n}})\).

Given data \(y_i\) stacked to make a vector \(y\) and \(x_{i}\) stacked to make a matrix \(X\), this can be solved for the coefficients \(\hat{\beta}\) according to

\[ \hat{\beta} = \left(X'X\right)^{-1} X'y . \]

Gauss-Markov Conditions#

To be sure that the estimates of these parameters are the best linear unbiased estimate, a few conditions need to hold: the Gauss-Markov conditions:

\(y\) is a linear function of the \(\beta_i\)
\(y\) and the \(x_i\) are randomly sampled from the population.
There is no perfect multi-collinearity of variables.
\(\mathbb{E}(\epsilon | x_1, \dots, x_n) = 0\) (unconfoundedness)
\(\text{Var}(\epsilon | x_1, \dots, x_n) = \sigma^2\) (homoskedasticity)

(1)-(4) also guarantee that OLS estimates are unbiased and \(\mathbb{E}(\hat{\beta}_i) = \beta_i\).

The classic linear model requires a 6th assumption; that \(\epsilon \thicksim \mathcal{N}(0, \sigma^2)\).

The interpretation of regression coefficients depends on what their units are to begin with, but you can always work it out by differentiating both sides of the model equation with respect to the \(x_i\). For example, for the first model equation above

\[ \frac{\partial y}{\partial x_i} = \beta_i \]

so we get the interpretation that \(\beta_i\) is the rate of change of y with respect to \(x_i\). If \(x_i\) and \(y\) are in levels, this means that a unit increase in \(x_i\) is associated with a \(\beta_i\) units increase in \(y\). If the right-hand side of the model is \(\ln x_i\) then we get

\[ \frac{\partial y}{\partial x_i} = \beta_i \frac{1}{x_i} \]

with some abuse of notation, we can rewrite this as \(\partial y = \beta_i \partial x_i/x_i\), which says that a percent change in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\). With a logged \(y\) variable, it’s a percent change in \(x_i\) that is associated with a percent change in \(y\), or \(\partial y/y = \beta_i \partial x_i/x_i\) (note that both sides of this equation are unitless in this case). Finally, another example that is important in practice is that of log differences, eg \(y = \beta_i (\ln x_i - \ln x_i')\). Again, we will abuse notation and say that this case may be represented as \(\partial y = \beta_i (\partial x_i/x_i - \partial x_i'/x_i')\), i.e. the difference in two percentages, a percentage point change, in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\).

Imports#

Let’s import some of the packages we’ll be using:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pyfixest as pf
from lets_plot import *

LetsPlot.setup_html()

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(
    seed_for_prng
)  # prng=probabilistic random number generator

Regression basics#

There are two ways to run regressions: passing the data directly as objects, and using formulae. For most situations, the formula API is more convenient and so most of what we’ll cover uses this.

For our workhorse regression package, we’ll be using pyfixest, which has impressive speed.

We’ll use the starwars dataset to run a regression of mass on height for star wars characters. This example borrows very heavily from notes by Grant McDermott. First, let’s bring the dataset in:

df = pd.read_csv(
    "https://github.com/aeturrell/coding-for-economists/raw/main/data/starwars.csv",
    index_col=0,
)
# Look at first few rows
df.head()

	name	height	mass	hair_color	eye_color	gender	homeworld	species
0	Luke Skywalker	172.0	77.0	blond	blue	male	Tatooine	Human
1	C-3PO	167.0	75.0	NaN	yellow	NaN	Tatooine	Droid
2	R2-D2	96.0	32.0	NaN	red	NaN	Naboo	Droid
3	Darth Vader	202.0	136.0	none	yellow	male	Tatooine	Human
4	Leia Organa	150.0	49.0	brown	brown	female	Alderaan	Human

Okay, now let’s do a regression using OLS and a formula that says our y-variable is mass and our regressor is height:

results_sw = pf.feols("mass ~ height", data=df)

Well, where are the results!? They’re stored in the object we created. To peek at them we need to call the summary function (and, for easy reading, I’ll print it out too using print)

results_sw.summary()

###

Estimation:  OLS
Dep. var.: mass, Fixed effects: 0
Inference:  iid
Observations:  59

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |     2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|---------:|--------:|
| Intercept     |    -13.810 |      111.155 |    -0.124 |      0.902 | -236.393 | 208.773 |
| height        |      0.639 |        0.626 |     1.020 |      0.312 |   -0.615 |   1.892 |
---
RMSE: 166.502 R2: 0.018 

What we’re seeing here are really several bits of information glued together. To just grab the coefficient results in a tidy format, use

results_sw.tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	-13.810314	111.154526	-0.124244	0.901559	-236.393413	208.772785
height	0.638571	0.626058	1.019986	0.312045	-0.615089	1.892231

You’ll have noticed that we got an intercept, even though we didn’t specify one in the formula. pyfixest adds in an intercept by default because, most of the time, you will want one. To turn it off, add a -1 at the end of the formula command, eg in this case you would call feols("mass ~ height -1", data=df).

The fit we got in the case with the intercept was pretty terrible; a low \(R^2\) and both of our confidence intervals are large and contain zero. What’s going on? If there’s one adage in regression that’s always worth paying attention to, it’s always plot your data. Let’s see what’s going on here (using matplotlib for plotting):

fig, ax = plt.subplots()
ax.scatter(df["height"], df["mass"], s=200, alpha=0.8)
ax.annotate(
    "Jabba the Hutt",
    df.iloc[df["mass"].idxmax()][["height", "mass"]],
    xytext=(0, -50),
    textcoords="offset points",
    arrowprops=dict(
        arrowstyle="fancy",
        color="k",
        connectionstyle="arc3,rad=0.3",
    ),
)
ax.set_ylim(0, None)
ax.set_title("Always plot the data", loc="left")
plt.show()

_images/8124476a1ef26c34675c8cf2ea11c257d8747d7f2c4b926602157c7b1a9c12f8.svg

Oh dear, Jabba’s been on the paddy frogs again, and he’s a bit of different case. When we’re estimating statistical relationships, we have all kinds of choices and should be wary about arbitrary decisions of what to include or exclude in case we fool ourselves about the generality of the relationship we are capturing. Let’s say we knew that we weren’t interested in Hutts though, but only in other species: in that case, it’s fair enough to filter out Jabba and run the regression without this obvious outlier. We’ll exclude any entry that contains the string ‘Jabba’ in the name column:

results_outlier_free = pf.feols(
    "mass ~ height", data=df[~df["name"].str.contains("Jabba")]
)
print(results_outlier_free.summary())

###

Estimation:  OLS
Dep. var.: mass, Fixed effects: 0
Inference:  iid
Observations:  58

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |    2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept     |    -32.541 |       12.561 |    -2.591 |      0.012 | -57.703 |  -7.379 |
| height        |      0.621 |        0.071 |     8.785 |      0.000 |   0.480 |   0.763 |
---
RMSE: 18.804 R2: 0.58 
None

This looks a lot more healthy. Not only is the model explaining a lot more of the data, but the coefficients are now significant.

Standard errors#

You’ll have seen that there’s a column for the standard error of the estimates in the regression table and a message saying that the covariance type of these is ‘nonrobust’. Let’s say that, instead, we want to use Eicker-Huber-White robust standard errors, aka “HC2” standard errors. We can specify to use these up front standard errors up front in the fit method:

pf.feols("mass ~ height", data=df, vcov="HC2").summary()

###

Estimation:  OLS
Dep. var.: mass, Fixed effects: 0
Inference:  HC2
Observations:  59

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |    2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept     |    -13.810 |       23.864 |    -0.579 |      0.565 | -61.596 |  33.976 |
| height        |      0.639 |        0.089 |     7.139 |      0.000 |   0.459 |   0.818 |
---
RMSE: 166.502 R2: 0.018 

Or, alternatively, we can go back to our existing results and get new standard errors “on the fly”:

results_sw.vcov("HC2").summary()

###

Estimation:  OLS
Dep. var.: mass, Fixed effects: 0
Inference:  HC2
Observations:  59

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |    2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept     |    -13.810 |       23.864 |    -0.579 |      0.565 | -61.596 |  33.976 |
| height        |      0.639 |        0.089 |     7.139 |      0.000 |   0.459 |   0.818 |
---
RMSE: 166.502 R2: 0.018 

There are several different types of standard errors available in pyfixest:

“iid”, “HC1”, “HC2”, “HC3”
“hetero”, for heteroskedasticity and autocorrelation consistent standard errors, for which you may want to also use some keyword arguments
clustered standard errors: see below. CRV1 and CRV3 available.

You can find information on all of these here.

For now, let’s look more closely at those last ones: clustered standard errors.

Clustered standard errors#

Often, we know something about the structure of likely errors, namely that they occur in groups. In the below example we use one-way clusters to capture this effect in the errors.

Note that in the below example, we grab a subset of the data for which a set of variables we’re interested in are defined, otherwise the below example would execute with an error because of missing cluster-group values.

xf = df.dropna(subset=["homeworld", "mass", "height", "species"])
pf.feols("mass ~ height", data=xf, vcov={"CRV1": "homeworld"}).summary()

###

Estimation:  OLS
Dep. var.: mass, Fixed effects: 0
Inference:  CRV1
Observations:  55

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |    2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept     |     -8.795 |       29.150 |    -0.302 |      0.765 | -67.859 |  50.268 |
| height        |      0.616 |        0.098 |     6.280 |      0.000 |   0.417 |   0.815 |
---
RMSE: 172.239 R2: 0.014 

We can add two-way clustering of standard errors using the following:

pf.feols("mass ~ height", data=xf, vcov={"CRV1": "homeworld + species"}).summary()

###

Estimation:  OLS
Dep. var.: mass, Fixed effects: 0
Inference:  CRV1
Observations:  55

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |    2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept     |     -8.795 |       30.589 |    -0.288 |      0.776 | -71.357 |  53.766 |
| height        |      0.616 |        0.104 |     5.938 |      0.000 |   0.404 |   0.828 |
---
RMSE: 172.239 R2: 0.014 

As you would generally expect, the addition of clustering has increased the standard errors.

Fixed effects and categorical variables#

Fixed effects are a way of allowing the intercept of a regression model to vary freely across individuals or groups. It is, for example, used to control for any individual-specific attributes that do not vary across time in panel data.

Let’s use the ‘mtcars’ dataset to demonstrate this. We’ll read it in and set the datatypes of some of the columns at the same time.

mpg = pd.read_csv(
    "https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv",
    dtype={"model": str, "mpg": float, "hp": float, "disp": float, "cyl": "category"},
)
mpg.head()

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110.0	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110.0	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93.0	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110.0	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175.0	3.15	3.440	17.02	0	0	3	2

Now we have our data in we want to regress mpg (miles per gallon) on hp (horsepower) with fixed effects for cyl (cylinders). Now we could just pop in a formula like this 'mpg ~ hp + cyl' because we took the trouble to declare that cyl was of datatype category when reading it in from the csv file. This means that statsmodels will treat it as a category and use it as a fixed effect by default.

But when I read that formula I get nervous that cyl might not have been processed correctly (ie it could have been read in as a float, which is what it looks like) and it might just be treated as a float (aka a continuous variable) in the regression. Which is not what we want at all. So, to be safe, and make our intentions explicit (even when the data is of type ‘category’), it’s best to use the syntax C(cyl) to ask for a fixed effect.

Here’s a regression which does that:

pf.feols("mpg ~ hp + C(cyl)", data=mpg).summary()

###

Estimation:  OLS
Dep. var.: mpg, Fixed effects: 0
Inference:  iid
Observations:  32

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |    2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept     |     28.650 |        1.588 |    18.044 |      0.000 |  25.398 |  31.903 |
| hp            |     -0.024 |        0.015 |    -1.560 |      0.130 |  -0.056 |   0.008 |
| C(cyl)[T.6]   |     -5.968 |        1.639 |    -3.640 |      0.001 |  -9.326 |  -2.610 |
| C(cyl)[T.8]   |     -8.521 |        2.326 |    -3.663 |      0.001 | -13.286 |  -3.756 |
---
RMSE: 2.943 R2: 0.754 

We can see here that two of the three possible values of cyl:

mpg["cyl"].unique()

['6', '4', '8']
Categories (3, object): ['4', '6', '8']

have been added as fixed effects regressors. The way that +C(cyl) has been added makes it so that the coefficients given are relative to the coefficient for the intercept. We can turn the intercept off to get a coefficient per unique cyl value:

pf.feols("mpg ~ hp + C(cyl) -1", data=mpg).tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	28.650118	1.587787	18.044056	0.000000	25.397684	31.902552
hp	-0.024039	0.015408	-1.560163	0.129954	-0.055600	0.007523
C(cyl)[T.6]	-5.967655	1.639278	-3.640418	0.001092	-9.325563	-2.609747
C(cyl)[T.8]	-8.520851	2.326075	-3.663188	0.001029	-13.285599	-3.756102

When there is an intercept, the coefficients of fixed effect variables can be interpreted as being the average of \(y\) for that class compared to the excluded classes holding all other categories and variables fixed.

High dimensional fixed effects, aka absorbing regression#

Sometimes, you just have a LOT of fixed effects (and perhaps you don’t particularly care about them individually). A common example is having a large number of firms as part of a panel. Fortunately, there are ways to make regression with high dimensional fixed effects be both fast and concise. (In Stata, this is provided by the reghdfe package.)

Let’s say we have a regression of the form

\[ y_i = x_i\cdot \beta + z_i\cdot \gamma +\epsilon_i \]

where \(y_i\) are observations indexed by \(i\), \(x_i\) are vectors of exogenous variables we care about the coefficients (\(\beta\)), \(z_i\) are vectors of fixed effects we don’t care too much about the coefficients (\(\gamma\)) for, and the \(\epsilon_i\) are errors. Then we can use an absorbing regression to solve for the \(\beta\) while ignoring the \(\gamma\).

Here’s an example using simulated data on workers taken from the docs for a different regression package (linearmodels). Let’s simulate some data first, with two fixed effects (state and firm) alongside the two exogenous variables we’re interested in.

from numpy.random import default_rng

rng = default_rng()  # Random number generator

# Create synthetic input data
nobs = 1_000_000  # No. observations
state_id = rng.integers(50, size=nobs)  # State identifier
firm_id = rng.integers(nobs // 5, size=nobs)  # Firm identifier (mean of 5 workers/firm)
x = rng.standard_normal((nobs, 2))  # Exogenous variables
sim = pd.DataFrame(
    {
        "state_id": pd.Categorical(state_id),
        "firm_id": pd.Categorical(firm_id),
        "exog_0": x[:, 0],
        "exog_1": x[:, 1],
    }
)

# Create synthetic relationship
beta = [1, 3]  # coefficients of interest
state_effects = rng.standard_normal(state_id.max() + 1)
state_effects = state_effects[state_id]  # Generate state fixed effects
firm_effects = rng.standard_normal(firm_id.max() + 1)
firm_effects = firm_effects[firm_id]  # Generate firm fixed effects
eps = rng.standard_normal(nobs)  # Generate errors
# Generate endogeneous outcome variable
sim["y"] = (
    sim["exog_0"] * beta[0]
    + sim["exog_1"] * beta[1]
    + firm_effects
    + state_effects
    + eps
)
sim.head()

	state_id	firm_id	exog_0	exog_1	y
0	44	53258	-0.857676	-0.214517	-0.735448
1	19	46892	1.566748	0.475191	3.549303
2	31	120287	-1.137149	-0.756408	-3.271708
3	37	3900	-0.525675	1.160383	2.367584
4	14	199504	-0.453512	-0.507235	0.420276

Now we pass this to pyfixest and with the state_id and firm_id variables entered after a vertical bar:

results_hdfe = pf.feols("y ~ exog_0 + exog_1 | state_id + firm_id", data=sim)
results_hdfe.summary()

###

Estimation:  OLS
Dep. var.: y, Fixed effects: state_id+firm_id
Inference:  CRV1
Observations:  1000000

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| exog_0        |      0.998 |        0.001 |   910.841 |      0.000 |  0.996 |   1.001 |
| exog_1        |      3.000 |        0.001 |  3553.902 |      0.000 |  2.998 |   3.002 |
---
RMSE: 0.895 R2: 0.938 R2 Within: 0.909 

So, from our 1,000,000 observations, we have roughly 200,000 fixed effects that have been scooped up and packed away, leaving us with just the coefficients, \(\beta\), on the exogenous variables of interest.

Transformations of regressors#

This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of (potentially non-linear) transform of the original features \(x_i\).

Logs and arcsinh#

You have two options for adding in logs: do them before, or do them in the formula. Doing them before just makes use of standard dataframe operations to declare a new column:

mpg["lnhp"] = np.log(mpg["hp"])
pf.feols("mpg ~ lnhp", data=mpg).tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	72.640475	6.004325	12.098024	4.551914e-13	60.378007	84.902943
lnhp	-10.764176	1.224304	-8.792079	8.387409e-10	-13.264538	-8.263814

Alternatively, you can specify the log directly in the formula:

results_ln = pf.feols("mpg ~ np.log(hp)", data=mpg)
results_ln.tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	72.640475	6.004325	12.098024	4.551914e-13	60.378007	84.902943
np.log(hp)	-10.764176	1.224304	-8.792079	8.387409e-10	-13.264538	-8.263814

Clearly, the first method will work for arcsinh(x) and log(x+1), but you can also pass both of these into the formula directly too. (For more on the pros and cons of arcsinh, see Bellemare and Wichman [2020].) Here it is with arcsinh:

pf.feols("mpg ~ np.arcsinh(hp)", data=mpg).tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	80.104142	6.849916	11.694179	1.060707e-12	66.114748	94.093536
np.arcsinh(hp)	-10.764584	1.224363	-8.791985	8.389380e-10	-13.265067	-8.264100

Interaction terms and powers#

This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of non-linear transform of the original features \(x_i\). Two of the most common transformations that you might want to use are interaction terms and polynomial terms. An example of an interaction term would be

\[ y = \beta_0 + \beta_1 x_1 \cdot x_2 \]

while an example of a polynomial term would be

\[ y = \beta_0 + \beta_1 x_1^2 \]

i.e. the last term enters only after it is multiplied by itself.

One note of warning: the interpretation of the effect of a variable is no longer as simple as was set out at the start of this chapter. To work out what the new interpretation is, the procedure is the same though: just take the derivative. In the case of the interaction model above, the effect of a unit change in \(x_1\) on \(y\) is now going to be a function of \(x_2\). In the case of the polynomial model above, the effect of a unit change in \(x_1\) on \(y\) will be \(2\beta_1 \cdot x_1\). For more on interaction terms, see Balli and Sørensen [2013].

Alright, with all of that preamble out of the way, let’s see how we actual do some of this! Let’s try including a linear and squared term in the regression of mpg on hp making use of the numpy power function:

res_poly = pf.feols("mpg ~ hp + np.power(hp, 2)", data=mpg)
res_poly.tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	40.409117	2.740759	14.743766	5.107026e-15	34.803635	46.014600
hp	-0.213308	0.034884	-6.114812	1.162972e-06	-0.284654	-0.141963
np.power(hp, 2)	0.000421	0.000098	4.274647	1.889240e-04	0.000219	0.000622

Now let’s include the original term in hp, a term in disp, and the interaction between them, which is represented by hp:disp in the table. We can do this by multiplying out the terms:

res_inter = pf.feols("mpg ~ hp * disp", data=mpg)
res_inter.tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	39.674263	2.914172	13.614248	7.172041e-14	33.704852	45.643674
hp	-0.097894	0.024745	-3.956175	4.725449e-04	-0.148581	-0.047207
disp	-0.073373	0.014387	-5.100113	2.109284e-05	-0.102843	-0.043904
hp:disp	0.000290	0.000087	3.336151	2.407037e-03	0.000112	0.000468

In the unusual case that you want only the interaction term, you write it as it appears in the table above:

res_only_inter = pf.feols("mpg ~ hp : disp", data=mpg)
res_only_inter.tidy()

	Estimate	Std. Error	t value	Pr(>\|t\|)	2.5%	97.5%
Coefficient
Intercept	25.809854	1.005184	25.676738	0.000000e+00	23.756994	27.862715
hp:disp	-0.000142	0.000019	-7.408952	2.958491e-08	-0.000181	-0.000103

The formula API explained#

As you will have seen ~ separates the left- and right-hand sides of the regression. + computes a set union, which will also be familiar from the examples above (ie it includes two terms as long as they are distinct). - computes a set difference; it adds the set of terms to the left of it while removing any that appear on the right of it. As we’ve seen, a*b is a short-hand for a + b + a:b, with the last term representing the interaction. / is short hand for a + a:b, which is useful if, for example b is nested within a, so it doesn’t make sense to control for b on its own. Actually, the : character can interact multiple terms so that (a + b):(d + c) is the same as a:c + a:d + b:c + b:d. C(a) tells statsmodels to treat a as a categorical variable that will be included as a fixed effect. Finally, as we saw above with powers, you can also pass in vectorised functions, such as np.log() and np.power(), directly into the formulae.

One gotcha with the formula API is ensuring that you have sensible variable names in your dataframe, i.e. ones that do not include whitespace or, to take a really pathological example, have the name ‘a + b’ for one of the columns that you want to regress on. You can dodge this kind of problem by passing in the variable name as, for example, Q("a + b") to be clear that the column name is anything within the Q("...").

Multiple regression models#

As is so often the case, you’re likely to want to run more than one model at once with different specifications. There are two ways you may wish to do this.

Comparing a set of models you’ve already created#

In the code above, we created three different models: results_ln, res_poly, and res_inter. Let’s say we now want to compare them in a table: we can! By using pyfixest’s etable() function. (If you need to do this in statsmodels, use summary_col() as explained here or stargazer.)

Let’s see etable in action:

pf.etable([results_ln, res_poly, res_inter], type="df")

	est1	est2	est3
depvar	mpg	mpg	mpg
np.log(hp)	-10.764*** \n (1.224)
hp		-0.213*** \n (0.035)	-0.098*** \n (0.025)
np.power(hp, 2)		0.000*** \n (0.000)
disp			-0.073*** \n (0.014)
hp:disp			0.000** \n (0.000)
Intercept	72.640*** \n (6.004)	40.409*** \n (2.741)	39.674*** \n (2.914)
Observations	32	32	32
S.E. type	iid	iid	iid
R2	0.720	0.756	0.820

There are a few different type= options, including dataframe, markdown, and latex:

pf.etable([results_ln, res_poly, res_inter], type="md")

index                   est1        est2        est3
---------------  -----------  ----------  ----------
depvar                   mpg         mpg         mpg
----------------------------------------------------
np.log(hp)       -10.764***
                     (1.224)
hp                            -0.213***   -0.098***
                                 (0.035)     (0.025)
np.power(hp, 2)                0.000***
                                 (0.000)
disp                                      -0.073***
                                             (0.014)
hp:disp                                     0.000**
                                             (0.000)
Intercept         72.640***   40.409***   39.674***
                     (6.004)     (2.741)     (2.914)
----------------------------------------------------
Observations              32          32          32
S.E. type                iid         iid         iid
R2                     0.720       0.756       0.820
----------------------------------------------------

And of course the latex one (type="tex") can be written to a file using open('regression.tex', 'w').write(...) where the ellipsis is the relevant etable() command.

Stepwise multiple models#

Sometimes you want to use a formula and add variables in step-wise. There are 4 stepwise functions: sw, sw0, csw, csw0.

Assume you have the following formula: y ~ x1 + sw(x2, x3). The stepwise function sw will estimate the following two models: y ~ x1 + x2 and y ~ x1 + x3. That is, each element in sw() is sequentially, and separately, added to the formula. If you use sw0 in lieu of sw, then the model y ~ x1 would also be estimated: the 0 in the name implies that the model without any stepwise element will also be estimated. This explains how to use sw and sw0.

The prefix c in the remaining two options, csw and csw0, means “cumulative”: each stepwise element is added to the next. That is, y ~ x1 + csw(x2, x3) would cause y ~ x1 + x2 and y ~ x1 + x2 + x3 to be estimated. The 0 has the same meaning, ie it estimates the most basic version. As an example, y ~ x1 + csw0(x2, x3) leads to the following three models being estimated: y ~ x1, y ~ x1 + x2 and y ~ x1 + x2 + x3.

Let’s see some examples (courtesy of the brilliant fixest documentation for the statistical language R). We’ll be using the iris dataset, which comes with the plotly package.

import plotly.express as px

iris = px.data.iris()
iris.columns = ["y", "x1", "x2", "x3", "species", "id"]
iris.head()

	y	x1	x2	x3	species	id
0	5.1	3.5	1.4	0.2	setosa	1
1	4.9	3.0	1.4	0.2	setosa	1
2	4.7	3.2	1.3	0.2	setosa	1
3	4.6	3.1	1.5	0.2	setosa	1
4	5.0	3.6	1.4	0.2	setosa	1

Let’s try a regular (not cumulative) stepwise regression:

pf.feols("y ~ sw(x1, x2, x3)", data=iris).summary()

###

Estimation:  OLS
Dep. var.: y, Fixed effects: 0
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      6.481 |        0.481 |    13.466 |      0.000 |  5.530 |   7.432 |
| x1            |     -0.209 |        0.156 |    -1.339 |      0.183 | -0.517 |   0.099 |
---
RMSE: 0.82 R2: 0.012 
###

Estimation:  OLS
Dep. var.: y, Fixed effects: 0
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      4.306 |        0.078 |    54.895 |      0.000 |  4.151 |   4.461 |
| x2            |      0.409 |        0.019 |    21.646 |      0.000 |  0.372 |   0.446 |
---
RMSE: 0.404 R2: 0.76 
###

Estimation:  OLS
Dep. var.: y, Fixed effects: 0
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      4.779 |        0.073 |    65.616 |      0.000 |  4.636 |   4.923 |
| x3            |      0.888 |        0.051 |    17.297 |      0.000 |  0.786 |   0.989 |
---
RMSE: 0.475 R2: 0.669 

And a cumulative stepwise regression:

pf.feols("y ~ csw(x1, x2, x3)", data=iris).summary()

###

Estimation:  OLS
Dep. var.: y, Fixed effects: 0
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      6.481 |        0.481 |    13.466 |      0.000 |  5.530 |   7.432 |
| x1            |     -0.209 |        0.156 |    -1.339 |      0.183 | -0.517 |   0.099 |
---
RMSE: 0.82 R2: 0.012 
###

Estimation:  OLS
Dep. var.: y, Fixed effects: 0
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      2.251 |        0.247 |     9.104 |      0.000 |  1.763 |   2.740 |
| x1            |      0.597 |        0.069 |     8.602 |      0.000 |  0.460 |   0.734 |
| x2            |      0.471 |        0.017 |    27.616 |      0.000 |  0.437 |   0.504 |
---
RMSE: 0.33 R2: 0.84 
###

Estimation:  OLS
Dep. var.: y, Fixed effects: 0
Inference:  iid
Observations:  150

| Coefficient   |   Estimate |   Std. Error |   t value |   Pr(>|t|) |   2.5% |   97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept     |      1.845 |        0.250 |     7.368 |      0.000 |  1.350 |   2.340 |
| x1            |      0.655 |        0.067 |     9.823 |      0.000 |  0.523 |   0.787 |
| x2            |      0.711 |        0.057 |    12.560 |      0.000 |  0.599 |   0.823 |
| x3            |     -0.563 |        0.127 |    -4.426 |      0.000 | -0.814 |  -0.311 |
---
RMSE: 0.31 R2: 0.859 

This is a very verbose way to get regression coefficients out! We can get a summary of the info by using etable again, but this time as a method rather than a stand-alone function:

reg_iris_csw = pf.feols("y ~ csw(x1, x2, x3)", data=iris)
reg_iris_csw.etable()

	y
	(1)	(2)	(3)
coef
x1	-0.209 (0.156)	0.597*** (0.069)	0.655*** (0.067)
x2		0.471*** (0.017)	0.711*** (0.057)
x3			-0.563*** (0.127)
Intercept	6.481*** (0.481)	2.251*** (0.247)	1.845*** (0.250)
stats
Observations	150	150	150
S.E. type	iid	iid	iid
R²	0.012	0.840	0.859
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

Even handier, pyfixest lets you visualise these results (via Lets-Plot):

reg_iris_csw.coefplot().show()

Instrumental variables#

We will again be using pyfixest for instrumental variables (IV) regression. This sub-section is indebted to another package, linearmodels, for the following example.

Recall that a good instrumental variable \(z\) has zero covariance with the error from the regression (which is untestable) and non-zero covariance with the variable of interest (which is).

Recall that in IV regression, we have a model of the form

\[\begin{split} \begin{split}y_i & = x_{1i}\hat{\beta}_1 + x_{2i}\hat{\beta}_2 + \epsilon_i \\ x_{2i} & = z_{1i}\hat{\delta} + z_{2i}\hat{\gamma} + \nu_i\end{split} \end{split}\]

where \(x_{1i}\) is a set of \(k_1\) exogenous regressors and \(x_{2i}\) is a set of \(k_2\) endogenous regressors such that \(\text{Cov}(x_{2i}, \epsilon_i)\neq 0\). This is a problem for the usual OLS assumptions (the right-hand side should be exogenous).

To get around this, in 2-stage least squares IV, we first regress \(x_{2i}\) on instruments that explain \(x_{2i}\) but not \(y_i\), and then regress \(y_i\) only on the predicted/estimated left-hand side from the first regression, ie on \(\hat{x}_{2i}\).

It’s always easiest to see an example, so let’s estimate what might cause (realised) cigarette demand for the 48 continental US states in 1995.

dfiv = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/AER/CigarettesSW.csv",
    dtype={"state": "category", "year": "category"},
).assign(
    rprice=lambda x: x["price"] / x["cpi"],
    rincome=lambda x: x["income"] / x["population"] / x["cpi"],
)
dfiv.head()

	rownames	state	year	cpi	population	packs	income	tax	price	taxs	rprice	rincome
0	1	AL	1985	1.076	3973000	116.486282	46014968	32.500004	102.181671	33.348335	94.964381	10.763866
1	2	AR	1985	1.076	2327000	128.534592	26210736	37.000000	101.474998	37.000000	94.307622	10.468165
2	3	AZ	1985	1.076	3184000	104.522614	43956936	31.000000	108.578751	36.170418	100.909622	12.830456
3	4	CA	1985	1.076	26444000	100.363037	447102816	26.000000	107.837341	32.104000	100.220580	15.713321
4	5	CO	1985	1.076	3209000	112.963539	49466672	31.000000	94.266663	31.000000	87.608425	14.326190

Now we’ll specify the model. It’s going to be in the form dep ~ exog + [endog ~ instruments], where endog will be regressed on instruments and dep will be regressed on both exog and the predicted values of endog.

In this case, the model will be

\[ \text{Price}_i = \hat{\pi}_0 + \hat{\pi}_1 \text{SalesTax}_i + v_i \]

in the first stage regression and

\[ \text{Packs}_i = \hat{\beta}_0 + \hat{\beta}_2\widehat{\text{Price}_i} + \hat{\beta}_1 \text{RealIncome}_i + u_i \]

in the second stage.

results_iv = pf.feols(
    "np.log(packs) ~ np.log(rincome) + 1 | year + state | np.log(rprice) ~ taxs ",
    data=dfiv,
    vcov={"CRV1": "year"},
)
results_iv.summary()

###

Estimation:  IV
Dep. var.: np.log(packs), Fixed effects: year+state
Inference:  CRV1
Observations:  96

| Coefficient     |   Estimate |   Std. Error |               t value |   Pr(>|t|) |   2.5% |   97.5% |
|:----------------|-----------:|-------------:|----------------------:|-----------:|-------:|--------:|
| np.log(rprice)  |     -1.279 |        0.000 | -4898551742658678.000 |      0.000 | -1.279 |  -1.279 |
| np.log(rincome) |      0.443 |        0.000 |  3372420275527436.500 |      0.000 |  0.443 |   0.443 |
---

We can compare the IV results against (naive) OLS. First, run the OLS equivalent:

results_cigs_ols = pf.feols(
    "np.log(packs) ~ np.log(rprice) + np.log(rincome) | year + state",
    data=dfiv,
    vcov={"CRV1": "year"},
)
results_cigs_ols.summary()

###

Estimation:  OLS
Dep. var.: np.log(packs), Fixed effects: year+state
Inference:  CRV1
Observations:  96

| Coefficient     |   Estimate |   Std. Error |               t value |   Pr(>|t|) |   2.5% |   97.5% |
|:----------------|-----------:|-------------:|----------------------:|-----------:|-------:|--------:|
| np.log(rprice)  |     -1.056 |        0.000 | -6310374603234333.000 |      0.000 | -1.056 |  -1.056 |
| np.log(rincome) |      0.497 |        0.000 |  2808549167977084.000 |      0.000 |  0.497 |   0.497 |
---
RMSE: 0.044 R2: 0.967 R2 Within: 0.556 

Let’s do a proper comparison of these models using etable(). Note the order: the IV results are the second column of data.

pf.etable([results_cigs_ols, results_iv], type="md")

index                     est1           est2
---------------  -------------  -------------
depvar           np.log(packs)  np.log(packs)
---------------------------------------------
np.log(rprice)      -1.056***      -1.279***
                       (0.000)        (0.000)
np.log(rincome)      0.497***       0.443***
                       (0.000)        (0.000)
---------------------------------------------
state                        x              x
year                         x              x
---------------------------------------------
Observations                96             96
S.E. type             by: year       by: year
R2                       0.967              -
---------------------------------------------

Once we take into account the fact that the real price is endogenous to (realised) demand, we find that its coefficient is more negative; i.e. an increase in the real price of cigarettes creates a bigger fall in number of packs bought.