Regression#
Introduction#
In this chapter, you’ll learn how to run linear regressions with code.
If you’re running this code (either by copying and pasting it, or by downloading it using the icons at the top of the page), you may need to install the packages it uses first. There’s a brief guide to installing packages in the chapter on Preliminaries.
Most of this chapter will rely on pyfixest and we are indebted to Alexander Fischer for his efforts in bringing a convenient and flexible way to absorb high dimensional fixed effects to Python via pyfixest. Some of the material in this chapter follows Grant McDermott’s excellent notes and the Library of Statistical Translation.
Quick Review of Econometrics#
Notation#
Greek letters, like \(\beta\), are the truth and represent parameters. Modified Greek letters are an estimate of the truth, for example \(\hat{\beta}\). Sometimes Greek letters will stand in for vectors of parameters. Most of the time, upper case Latin characters such as \(X\) will represent random variables (which could have more than one dimension). Lower case letters from the Latin alphabet denote realised data, for instance \(x\) (which again could be multi-dimensional). Modified Latin alphabet letters denote computations performed on data, for instance \(\bar{x} = \frac{1}{n} \displaystyle\sum_{i} x_i\) where \(n\) is number of samples.
Ordinary least squares (OLS) regression can be used to estimate the parameters of certain types of model, most typically models of the form
This generic model says that the value of an outcome variable \(y\) is a linear function of one or more input predictor variables \(x_i\), where the \(x_i\) could be transforms of original data. But the above equation is a platonic ideal, what we call a data generating process (DGP). OLS allows us to recover estimates of the parameters of the model , i.e. to find \(\hat{\beta_i}\) and to enable us to write an estimated model:
This equation can also be expressed in matrix form as
where \(x' = (1, x_1, \dots, x_{n})'\) and \(\hat{\beta} = (\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_{n}})\).
Given data \(y_i\) stacked to make a vector \(y\) and \(x_{i}\) stacked to make a matrix \(X\), this can be solved for the coefficients \(\hat{\beta}\) according to
Gauss-Markov Conditions#
To be sure that the estimates of these parameters are the best linear unbiased estimate, a few conditions need to hold: the Gauss-Markov conditions:
\(y\) is a linear function of the \(\beta_i\)
\(y\) and the \(x_i\) are randomly sampled from the population.
There is no perfect multi-collinearity of variables.
\(\mathbb{E}(\epsilon | x_1, \dots, x_n) = 0\) (unconfoundedness)
\(\text{Var}(\epsilon | x_1, \dots, x_n) = \sigma^2\) (homoskedasticity)
(1)-(4) also guarantee that OLS estimates are unbiased and \(\mathbb{E}(\hat{\beta}_i) = \beta_i\).
The classic linear model requires a 6th assumption; that \(\epsilon \thicksim \mathcal{N}(0, \sigma^2)\).
The interpretation of regression coefficients depends on what their units are to begin with, but you can always work it out by differentiating both sides of the model equation with respect to the \(x_i\). For example, for the first model equation above
so we get the interpretation that \(\beta_i\) is the rate of change of y with respect to \(x_i\). If \(x_i\) and \(y\) are in levels, this means that a unit increase in \(x_i\) is associated with a \(\beta_i\) units increase in \(y\). If the right-hand side of the model is \(\ln x_i\) then we get
with some abuse of notation, we can rewrite this as \(\partial y = \beta_i \partial x_i/x_i\), which says that a percent change in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\). With a logged \(y\) variable, it’s a percent change in \(x_i\) that is associated with a percent change in \(y\), or \(\partial y/y = \beta_i \partial x_i/x_i\) (note that both sides of this equation are unitless in this case). Finally, another example that is important in practice is that of log differences, eg \(y = \beta_i (\ln x_i - \ln x_i')\). Again, we will abuse notation and say that this case may be represented as \(\partial y = \beta_i (\partial x_i/x_i - \partial x_i'/x_i')\), i.e. the difference in two percentages, a percentage point change, in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\).
Imports#
Let’s import some of the packages we’ll be using:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lets_plot import *
import statsmodels.api as sm
import statsmodels.formula.api as smf
import pyfixest as pf
LetsPlot.setup_html()
# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(seed_for_prng) # prng=probabilistic random number generator
Regression basics#
There are two ways to run regressions: passing the data directly as objects, and using formulae. For most situations, the formula API is more convenient and so most of what we’ll cover uses this.
For our workhorse regression package, we’ll be using pyfixest, which has impressive speed.
We’ll use the starwars dataset to run a regression of mass on height for star wars characters. This example borrows very heavily from notes by Grant McDermott. First, let’s bring the dataset in:
df = pd.read_csv(
"https://github.com/aeturrell/coding-for-economists/raw/main/data/starwars.csv",
index_col=0,
)
# Look at first few rows
df.head()
name | height | mass | hair_color | eye_color | gender | homeworld | species | |
---|---|---|---|---|---|---|---|---|
0 | Luke Skywalker | 172.0 | 77.0 | blond | blue | male | Tatooine | Human |
1 | C-3PO | 167.0 | 75.0 | NaN | yellow | NaN | Tatooine | Droid |
2 | R2-D2 | 96.0 | 32.0 | NaN | red | NaN | Naboo | Droid |
3 | Darth Vader | 202.0 | 136.0 | none | yellow | male | Tatooine | Human |
4 | Leia Organa | 150.0 | 49.0 | brown | brown | female | Alderaan | Human |
Okay, now let’s do a regression using OLS and a formula that says our y-variable is mass and our regressor is height:
results_sw = pf.feols("mass ~ height", data=df)
Well, where are the results!? They’re stored in the object we created. To peek at them we need to call the summary function (and, for easy reading, I’ll print it out too using print
)
results_sw.summary()
###
Estimation: OLS
Dep. var.: mass
Inference: iid
Observations: 59
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|---------:|--------:|
| Intercept | -13.810 | 111.155 | -0.124 | 0.902 | -236.393 | 208.773 |
| height | 0.639 | 0.626 | 1.020 | 0.312 | -0.615 | 1.892 |
---
RMSE: 166.502 R2: 0.018
What we’re seeing here are really several bits of information glued together. To just grab the coefficient results in a tidy format, use
results_sw.tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
Intercept | -13.810314 | 111.154526 | -0.124244 | 0.901559 | -236.393412 | 208.772785 |
height | 0.638571 | 0.626058 | 1.019986 | 0.312045 | -0.615089 | 1.892231 |
You’ll have noticed that we got an intercept, even though we didn’t specify one in the formula. pyfixest adds in an intercept by default because, most of the time, you will want one. To turn it off, add a -1
at the end of the formula command, eg in this case you would call feols("mass ~ height -1", data=df)
.
The fit we got in the case with the intercept was pretty terrible; a low \(R^2\) and both of our confidence intervals are large and contain zero. What’s going on? If there’s one adage in regression that’s always worth paying attention to, it’s always plot your data. Let’s see what’s going on here (using matplotlib for plotting):
fig, ax = plt.subplots()
ax.scatter(df["height"], df["mass"], s=200, alpha=0.8)
ax.annotate(
"Jabba the Hutt",
df.iloc[df["mass"].idxmax()][["height", "mass"]],
xytext=(0, -50),
textcoords="offset points",
arrowprops=dict(
arrowstyle="fancy",
color="k",
connectionstyle="arc3,rad=0.3",
),
)
ax.set_ylim(0, None)
ax.set_title("Always plot the data", loc="left")
plt.show()
Oh dear, Jabba’s been on the paddy frogs again, and he’s a bit of different case. When we’re estimating statistical relationships, we have all kinds of choices and should be wary about arbitrary decisions of what to include or exclude in case we fool ourselves about the generality of the relationship we are capturing. Let’s say we knew that we weren’t interested in Hutts though, but only in other species: in that case, it’s fair enough to filter out Jabba and run the regression without this obvious outlier. We’ll exclude any entry that contains the string ‘Jabba’ in the name
column:
results_outlier_free = pf.feols(
"mass ~ height", data=df[~df["name"].str.contains("Jabba")]
)
print(results_outlier_free.summary())
###
Estimation: OLS
Dep. var.: mass
Inference: iid
Observations: 58
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept | -32.541 | 12.561 | -2.591 | 0.012 | -57.703 | -7.379 |
| height | 0.621 | 0.071 | 8.785 | 0.000 | 0.480 | 0.763 |
---
RMSE: 18.804 R2: 0.58
None
This looks a lot more healthy. Not only is the model explaining a lot more of the data, but the coefficients are now significant.
Standard errors#
You’ll have seen that there’s a column for the standard error of the estimates in the regression table and a message saying that the covariance type of these is ‘nonrobust’. Let’s say that, instead, we want to use Eicker-Huber-White robust standard errors, aka “HC2” standard errors. We can specify to use these up front standard errors up front in the fit method:
pf.feols("mass ~ height", data=df, vcov="HC2").summary()
###
Estimation: OLS
Dep. var.: mass
Inference: HC2
Observations: 59
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept | -13.810 | 23.864 | -0.579 | 0.565 | -61.596 | 33.976 |
| height | 0.639 | 0.089 | 7.139 | 0.000 | 0.459 | 0.818 |
---
RMSE: 166.502 R2: 0.018
Or, alternatively, we can go back to our existing results and get new standard errors “on the fly”:
results_sw.vcov("HC2").summary()
###
Estimation: OLS
Dep. var.: mass
Inference: HC2
Observations: 59
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept | -13.810 | 23.864 | -0.579 | 0.565 | -61.596 | 33.976 |
| height | 0.639 | 0.089 | 7.139 | 0.000 | 0.459 | 0.818 |
---
RMSE: 166.502 R2: 0.018
There are several different types of standard errors available in pyfixest:
“iid”, “HC1”, “HC2”, “HC3”
“hetero”, for heteroskedasticity and autocorrelation consistent standard errors, for which you may want to also use some keyword arguments
clustered standard errors: see below. CRV1 and CRV3 available.
You can find information on all of these here.
For now, let’s look more closely at those last ones: clustered standard errors.
Clustered standard errors#
Often, we know something about the structure of likely errors, namely that they occur in groups. In the below example we use one-way clusters to capture this effect in the errors.
Note that in the below example, we grab a subset of the data for which a set of variables we’re interested in are defined, otherwise the below example would execute with an error because of missing cluster-group values.
xf = df.dropna(subset=["homeworld", "mass", "height", "species"])
pf.feols("mass ~ height", data=xf, vcov={"CRV1": "homeworld"}).summary()
###
Estimation: OLS
Dep. var.: mass
Inference: CRV1
Observations: 55
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept | -8.795 | 29.150 | -0.302 | 0.765 | -67.859 | 50.268 |
| height | 0.616 | 0.098 | 6.280 | 0.000 | 0.417 | 0.815 |
---
RMSE: 172.239 R2: 0.014
We can add two-way clustering of standard errors using the following:
pf.feols("mass ~ height", data=xf, vcov={"CRV1": "homeworld + species"}).summary()
###
Estimation: OLS
Dep. var.: mass
Inference: CRV1
Observations: 55
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept | -8.795 | 30.589 | -0.288 | 0.776 | -71.357 | 53.766 |
| height | 0.616 | 0.104 | 5.938 | 0.000 | 0.404 | 0.828 |
---
RMSE: 172.239 R2: 0.014
As you would generally expect, the addition of clustering has increased the standard errors.
Fixed effects and categorical variables#
Fixed effects are a way of allowing the intercept of a regression model to vary freely across individuals or groups. It is, for example, used to control for any individual-specific attributes that do not vary across time in panel data.
Let’s use the ‘mtcars’ dataset to demonstrate this. We’ll read it in and set the datatypes of some of the columns at the same time.
mpg = pd.read_csv(
"https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv",
dtype={"model": str, "mpg": float, "hp": float, "disp": float, "cyl": "category"},
)
mpg.head()
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110.0 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110.0 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | Datsun 710 | 22.8 | 4 | 108.0 | 93.0 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110.0 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175.0 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Now we have our data in we want to regress mpg (miles per gallon) on hp (horsepower) with fixed effects for cyl (cylinders). Now we could just pop in a formula like this 'mpg ~ hp + cyl'
because we took the trouble to declare that cyl
was of datatype category when reading it in from the csv file. This means that statsmodels will treat it as a category and use it as a fixed effect by default.
But when I read that formula I get nervous that cyl
might not have been processed correctly (ie it could have been read in as a float, which is what it looks like) and it might just be treated as a float (aka a continuous variable) in the regression. Which is not what we want at all. So, to be safe, and make our intentions explicit (even when the data is of type ‘category’), it’s best to use the syntax C(cyl)
to ask for a fixed effect.
Here’s a regression which does that:
pf.feols("mpg ~ hp + C(cyl)", data=mpg).summary()
###
Estimation: OLS
Dep. var.: mpg
Inference: iid
Observations: 32
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|--------:|--------:|
| Intercept | 28.650 | 1.588 | 18.044 | 0.000 | 25.398 | 31.903 |
| hp | -0.024 | 0.015 | -1.560 | 0.130 | -0.056 | 0.008 |
| C(cyl)[T.6] | -5.968 | 1.639 | -3.640 | 0.001 | -9.326 | -2.610 |
| C(cyl)[T.8] | -8.521 | 2.326 | -3.663 | 0.001 | -13.286 | -3.756 |
---
RMSE: 2.943 R2: 0.754
We can see here that two of the three possible values of cyl
:
mpg["cyl"].unique()
['6', '4', '8']
Categories (3, object): ['4', '6', '8']
have been added as fixed effects regressors. The way that +C(cyl)
has been added makes it so that the coefficients given are relative to the coefficient for the intercept. We can turn the intercept off to get a coefficient per unique cyl
value:
pf.feols("mpg ~ hp + C(cyl) -1", data=mpg).tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
hp | -0.024039 | 0.015408 | -1.560163 | 1.299540e-01 | -0.055600 | 0.007523 |
C(cyl)[T.4] | 28.650118 | 1.587787 | 18.044056 | 0.000000e+00 | 25.397684 | 31.902552 |
C(cyl)[T.6] | 22.682463 | 2.228049 | 10.180415 | 6.476264e-11 | 18.118512 | 27.246414 |
C(cyl)[T.8] | 20.129267 | 3.331419 | 6.042251 | 1.633553e-06 | 13.305166 | 26.953369 |
When there is an intercept, the coefficients of fixed effect variables can be interpreted as being the average of \(y\) for that class compared to the excluded classes holding all other categories and variables fixed.
High dimensional fixed effects, aka absorbing regression#
Sometimes, you just have a LOT of fixed effects (and perhaps you don’t particularly care about them individually). A common example is having a large number of firms as part of a panel. Fortunately, there are ways to make regression with high dimensional fixed effects be both fast and concise. (In Stata, this is provided by the reghdfe
package.)
Let’s say we have a regression of the form
where \(y_i\) are observations indexed by \(i\), \(x_i\) are vectors of exogenous variables we care about the coefficients (\(\beta\)), \(z_i\) are vectors of fixed effects we don’t care too much about the coefficients (\(\gamma\)) for, and the \(\epsilon_i\) are errors. Then we can use an absorbing regression to solve for the \(\beta\) while ignoring the \(\gamma\).
Here’s an example using simulated data on workers taken from the docs for a different regression package (linearmodels). Let’s simulate some data first, with two fixed effects (state and firm) alongside the two exogenous variables we’re interested in.
from numpy.random import default_rng
rng = default_rng() # Random number generator
# Create synthetic input data
nobs = 1_000_000 # No. observations
state_id = rng.integers(50, size=nobs) # State identifier
firm_id = rng.integers(nobs // 5, size=nobs) # Firm identifier (mean of 5 workers/firm)
x = rng.standard_normal((nobs, 2)) # Exogenous variables
sim = pd.DataFrame(
{
"state_id": pd.Categorical(state_id),
"firm_id": pd.Categorical(firm_id),
"exog_0": x[:, 0],
"exog_1": x[:, 1],
}
)
# Create synthetic relationship
beta = [1, 3] # coefficients of interest
state_effects = rng.standard_normal(state_id.max() + 1)
state_effects = state_effects[state_id] # Generate state fixed effects
firm_effects = rng.standard_normal(firm_id.max() + 1)
firm_effects = firm_effects[firm_id] # Generate firm fixed effects
eps = rng.standard_normal(nobs) # Generate errors
# Generate endogeneous outcome variable
sim["y"] = (
sim["exog_0"] * beta[0]
+ sim["exog_1"] * beta[1]
+ firm_effects
+ state_effects
+ eps
)
sim.head()
state_id | firm_id | exog_0 | exog_1 | y | |
---|---|---|---|---|---|
0 | 1 | 186213 | -0.761561 | 0.340281 | -3.057208 |
1 | 17 | 159758 | -0.680405 | 1.086975 | 0.445564 |
2 | 18 | 104489 | 0.683346 | -0.106550 | -1.247462 |
3 | 5 | 30840 | -0.207784 | 0.169841 | 1.902407 |
4 | 25 | 144311 | 0.295865 | 1.241407 | 3.598971 |
Now we pass this to pyfixest and with the state_id
and firm_id
variables entered after a vertical bar:
results_hdfe = pf.feols("y ~ exog_0 + exog_1 | state_id + firm_id", data=sim)
results_hdfe.summary()
###
Estimation: OLS
Dep. var.: y, Fixed effects: state_id+firm_id
Inference: CRV1
Observations: 1000000
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| exog_0 | 0.998 | 0.001 | 1035.827 | 0.000 | 0.996 | 1.000 |
| exog_1 | 3.001 | 0.001 | 3007.267 | 0.000 | 2.999 | 3.003 |
---
RMSE: 0.894 R2: 0.937 R2 Within: 0.91
So, from our 1,000,000 observations, we have roughly 200,000 fixed effects that have been scooped up and packed away, leaving us with just the coefficients, \(\beta\), on the exogenous variables of interest.
Transformations of regressors#
This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of (potentially non-linear) transform of the original features \(x_i\).
Logs and arcsinh#
You have two options for adding in logs: do them before, or do them in the formula. Doing them before just makes use of standard dataframe operations to declare a new column:
mpg["lnhp"] = np.log(mpg["hp"])
pf.feols("mpg ~ lnhp", data=mpg).tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
Intercept | 72.640475 | 6.004325 | 12.098024 | 4.551914e-13 | 60.378007 | 84.902943 |
lnhp | -10.764176 | 1.224304 | -8.792079 | 8.387409e-10 | -13.264538 | -8.263814 |
Alternatively, you can specify the log directly in the formula:
results_ln = pf.feols("mpg ~ np.log(hp)", data=mpg)
results_ln.tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
Intercept | 72.640475 | 6.004325 | 12.098024 | 4.551914e-13 | 60.378007 | 84.902943 |
np.log(hp) | -10.764176 | 1.224304 | -8.792079 | 8.387409e-10 | -13.264538 | -8.263814 |
Clearly, the first method will work for arcsinh(x)
and log(x+1)
, but you can also pass both of these into the formula directly too. (For more on the pros and cons of arcsinh, see Bellemare and Wichman [2020].) Here it is with arcsinh:
pf.feols("mpg ~ np.arcsinh(hp)", data=mpg).tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
Intercept | 80.104142 | 6.849916 | 11.694179 | 1.060707e-12 | 66.114748 | 94.093536 |
np.arcsinh(hp) | -10.764584 | 1.224363 | -8.791985 | 8.389380e-10 | -13.265067 | -8.264100 |
Interaction terms and powers#
This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of non-linear transform of the original features \(x_i\). Two of the most common transformations that you might want to use are interaction terms and polynomial terms. An example of an interaction term would be
while an example of a polynomial term would be
i.e. the last term enters only after it is multiplied by itself.
One note of warning: the interpretation of the effect of a variable is no longer as simple as was set out at the start of this chapter. To work out what the new interpretation is, the procedure is the same though: just take the derivative. In the case of the interaction model above, the effect of a unit change in \(x_1\) on \(y\) is now going to be a function of \(x_2\). In the case of the polynomial model above, the effect of a unit change in \(x_1\) on \(y\) will be \(2\beta_1 \cdot x_1\). For more on interaction terms, see Balli and Sørensen [2013].
Alright, with all of that preamble out of the way, let’s see how we actual do some of this! Let’s try including a linear and squared term in the regression of mpg
on hp
making use of the numpy power function:
res_poly = pf.feols("mpg ~ hp + np.power(hp, 2)", data=mpg)
res_poly.tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
Intercept | 40.409117 | 2.740759 | 14.743766 | 5.107026e-15 | 34.803635 | 46.014600 |
hp | -0.213308 | 0.034884 | -6.114812 | 1.162972e-06 | -0.284654 | -0.141963 |
np.power(hp,2) | 0.000421 | 0.000098 | 4.274647 | 1.889240e-04 | 0.000219 | 0.000622 |
Now let’s include the original term in hp, a term in disp, and the interaction between them, which is represented by hp:disp in the table. We can do this by multiplying out the terms:
res_inter = pf.feols("mpg ~ hp * disp", data=mpg)
res_inter.tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
Intercept | 39.674263 | 2.914172 | 13.614248 | 7.172041e-14 | 33.704852 | 45.643674 |
hp | -0.097894 | 0.024745 | -3.956175 | 4.725449e-04 | -0.148581 | -0.047207 |
disp | -0.073373 | 0.014387 | -5.100113 | 2.109284e-05 | -0.102843 | -0.043904 |
hp:disp | 0.000290 | 0.000087 | 3.336151 | 2.407037e-03 | 0.000112 | 0.000468 |
In the unusual case that you want only the interaction term, you write it as it appears in the table above:
res_only_inter = pf.feols("mpg ~ hp : disp", data=mpg)
res_only_inter.tidy()
Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% | |
---|---|---|---|---|---|---|
Coefficient | ||||||
Intercept | 25.809854 | 1.005184 | 25.676738 | 0.000000e+00 | 23.756994 | 27.862715 |
hp:disp | -0.000142 | 0.000019 | -7.408952 | 2.958491e-08 | -0.000181 | -0.000103 |
The formula API explained#
As you will have seen ~
separates the left- and right-hand sides of the regression. +
computes a set union, which will also be familiar from the examples above (ie it includes two terms as long as they are distinct). -
computes a set difference; it adds the set of terms to the left of it while removing any that appear on the right of it. As we’ve seen, a*b
is a short-hand for a + b + a:b
, with the last term representing the interaction. /
is short hand for a + a:b
, which is useful if, for example b
is nested within a
, so it doesn’t make sense to control for b
on its own. Actually, the :
character can interact multiple terms so that (a + b):(d + c)
is the same as a:c + a:d + b:c + b:d
. C(a)
tells statsmodels to treat a
as a categorical variable that will be included as a fixed effect. Finally, as we saw above with powers, you can also pass in vectorised functions, such as np.log()
and np.power()
, directly into the formulae.
One gotcha with the formula API is ensuring that you have sensible variable names in your dataframe, i.e. ones that do not include whitespace or, to take a really pathological example, have the name ‘a + b’ for one of the columns that you want to regress on. You can dodge this kind of problem by passing in the variable name as, for example, Q("a + b")
to be clear that the column name is anything within the Q("...")
.
Multiple regression models#
As is so often the case, you’re likely to want to run more than one model at once with different specifications. There are two ways you may wish to do this.
Comparing a set of models you’ve already created#
In the code above, we created three different models: results_ln
, res_poly
, and res_inter
. Let’s say we now want to compare them in a table: we can! By using pyfixest’s etable()
function. (If you need to do this in statsmodels, use summary_col()
as explained here or stargazer.)
Let’s see etable
in action:
pf.etable([results_ln, res_poly, res_inter], type="df")
est1 | est2 | est3 | ||
---|---|---|---|---|
0 | depvar | mpg | mpg | mpg |
1 | Intercept | 72.640*** (6.004) | 40.409*** (2.741) | 39.674*** (2.914) |
2 | np.log(hp) | -10.764*** (1.224) | ||
3 | hp | -0.213*** (0.035) | -0.098*** (0.025) | |
4 | np.power(hp,2) | 0.000*** (0.000) | ||
5 | disp | -0.073*** (0.014) | ||
6 | hp:disp | 0.000** (0.000) | ||
7 | R2 | 0.720 | 0.756 | 0.820 |
8 | S.E. type | iid | iid | iid |
9 | Observations | 32 | 32 | 32 |
There are a few different type=
options, including dataframe, markdown, and latex:
pf.etable([results_ln, res_poly, res_inter], type="md")
est1 est2 est3
-------------- ------------------ ----------------- -----------------
depvar mpg mpg mpg
------------------------------------------------------------------------
Intercept 72.640*** (6.004) 40.409*** (2.741) 39.674*** (2.914)
np.log(hp) -10.764*** (1.224)
hp -0.213*** (0.035) -0.098*** (0.025)
np.power(hp,2) 0.000*** (0.000)
disp -0.073*** (0.014)
hp:disp 0.000** (0.000)
------------------------------------------------------------------------
------------------------------------------------------------------------
R2 0.720 0.756 0.820
S.E. type iid iid iid
Observations 32 32 32
------------------------------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (Std. Error)
And of course the latex one (type="tex"
) can be written to a file using open('regression.tex', 'w').write(...)
where the ellipsis is the relevant etable()
command.
Stepwise multiple models#
Sometimes you want to use a formula and add variables in step-wise. There are 4 stepwise functions: sw
, sw0
, csw
, csw0
.
Assume you have the following formula: y ~ x1 + sw(x2, x3)
. The stepwise function sw will estimate the following two models: y ~ x1 + x2
and y ~ x1 + x3
. That is, each element in sw()
is sequentially, and separately, added to the formula. If you use sw0
in lieu of sw
, then the model y ~ x1
would also be estimated: the 0
in the name implies that the model without any stepwise element will also be estimated. This explains how to use sw
and sw0
.
The prefix c in the remaining two options, csw
and csw0
, means “cumulative”: each stepwise element is added to the next. That is, y ~ x1 + csw(x2, x3)
would cause y ~ x1 + x2
and y ~ x1 + x2 + x3
to be estimated. The 0
has the same meaning, ie it estimates the most basic version. As an example, y ~ x1 + csw0(x2, x3)
leads to the following three models being estimated: y ~ x1
, y ~ x1 + x2
and y ~ x1 + x2 + x3
.
Let’s see some examples (courtesy of the brilliant fixest documentation for the statistical language R). We’ll be using the iris dataset, which comes with the plotly package.
import plotly.express as px
iris = px.data.iris()
iris.columns = ["y", "x1", "x2", "x3", "species", "id"]
iris.head()
y | x1 | x2 | x3 | species | id | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 1 |
Let’s try a regular (not cumulative) stepwise regression:
pf.feols("y ~ sw(x1, x2, x3)", data=iris).summary()
###
Estimation: OLS
Dep. var.: y
Inference: iid
Observations: 150
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 6.481 | 0.481 | 13.466 | 0.000 | 5.530 | 7.432 |
| x1 | -0.209 | 0.156 | -1.339 | 0.183 | -0.517 | 0.099 |
---
RMSE: 0.82 R2: 0.012
###
Estimation: OLS
Dep. var.: y
Inference: iid
Observations: 150
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 4.306 | 0.078 | 54.895 | 0.000 | 4.151 | 4.461 |
| x2 | 0.409 | 0.019 | 21.646 | 0.000 | 0.372 | 0.446 |
---
RMSE: 0.404 R2: 0.76
###
Estimation: OLS
Dep. var.: y
Inference: iid
Observations: 150
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 4.779 | 0.073 | 65.616 | 0.000 | 4.636 | 4.923 |
| x3 | 0.888 | 0.051 | 17.297 | 0.000 | 0.786 | 0.989 |
---
RMSE: 0.475 R2: 0.669
And a cumulative stepwise regression:
pf.feols("y ~ csw(x1, x2, x3)", data=iris).summary()
###
Estimation: OLS
Dep. var.: y
Inference: iid
Observations: 150
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 6.481 | 0.481 | 13.466 | 0.000 | 5.530 | 7.432 |
| x1 | -0.209 | 0.156 | -1.339 | 0.183 | -0.517 | 0.099 |
---
RMSE: 0.82 R2: 0.012
###
Estimation: OLS
Dep. var.: y
Inference: iid
Observations: 150
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 2.251 | 0.247 | 9.104 | 0.000 | 1.763 | 2.740 |
| x1 | 0.597 | 0.069 | 8.602 | 0.000 | 0.460 | 0.734 |
| x2 | 0.471 | 0.017 | 27.616 | 0.000 | 0.437 | 0.504 |
---
RMSE: 0.33 R2: 0.84
###
Estimation: OLS
Dep. var.: y
Inference: iid
Observations: 150
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:--------------|-----------:|-------------:|----------:|-----------:|-------:|--------:|
| Intercept | 1.845 | 0.250 | 7.368 | 0.000 | 1.350 | 2.340 |
| x1 | 0.655 | 0.067 | 9.823 | 0.000 | 0.523 | 0.787 |
| x2 | 0.711 | 0.057 | 12.560 | 0.000 | 0.599 | 0.823 |
| x3 | -0.563 | 0.127 | -4.426 | 0.000 | -0.814 | -0.311 |
---
RMSE: 0.31 R2: 0.859
This is a very verbose way to get regression coefficients out! We can get a summary of the info by using etable
again, but this time as a method rather than a stand-alone function:
reg_iris_csw = pf.feols("y ~ csw(x1, x2, x3)", data=iris)
reg_iris_csw.etable()
fml | y~x1 | y~x1+x2 | y~x1+x2+x3 | ||||||
---|---|---|---|---|---|---|---|---|---|
Coefficient | Intercept | x1 | Intercept | x1 | x2 | Intercept | x1 | x2 | x3 |
Estimate | 6.481 | -0.209 | 2.251 | 0.597 | 0.471 | 1.845 | 0.655 | 0.711 | -0.563 |
Std. Error | 0.481 | 0.156 | 0.247 | 0.069 | 0.017 | 0.250 | 0.067 | 0.057 | 0.127 |
t value | 13.466 | -1.339 | 9.104 | 8.602 | 27.616 | 7.368 | 9.823 | 12.560 | -4.426 |
Pr(>|t|) | 0.000 | 0.183 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
2.5% | 5.530 | -0.517 | 1.763 | 0.460 | 0.437 | 1.350 | 0.523 | 0.599 | -0.814 |
97.5% | 7.432 | 0.099 | 2.740 | 0.734 | 0.504 | 2.340 | 0.787 | 0.823 | -0.311 |
Even handier, pyfixest lets you visualise these results (via Lets-Plot):
reg_iris_csw.coefplot().show()
Instrumental variables#
We will again be using pyfixest for instrumental variables (IV) regression. This sub-section is indebted to another package, linearmodels, for the following example.
Recall that a good instrumental variable \(z\) has zero covariance with the error from the regression (which is untestable) and non-zero covariance with the variable of interest (which is).
Recall that in IV regression, we have a model of the form
where \(x_{1i}\) is a set of \(k_1\) exogenous regressors and \(x_{2i}\) is a set of \(k_2\) endogenous regressors such that \(\text{Cov}(x_{2i}, \epsilon_i)\neq 0\). This is a problem for the usual OLS assumptions (the right-hand side should be exogenous).
To get around this, in 2-stage least squares IV, we first regress \(x_{2i}\) on instruments that explain \(x_{2i}\) but not \(y_i\), and then regress \(y_i\) only on the predicted/estimated left-hand side from the first regression, ie on \(\hat{x}_{2i}\).
It’s always easiest to see an example, so let’s estimate what might cause (realised) cigarette demand for the 48 continental US states in 1995.
dfiv = pd.read_csv(
"https://vincentarelbundock.github.io/Rdatasets/csv/AER/CigarettesSW.csv",
dtype={"state": "category", "year": "category"},
).assign(
rprice=lambda x: x["price"] / x["cpi"],
rincome=lambda x: x["income"] / x["population"] / x["cpi"],
)
dfiv.head()
rownames | state | year | cpi | population | packs | income | tax | price | taxs | rprice | rincome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | AL | 1985 | 1.076 | 3973000 | 116.486282 | 46014968 | 32.500004 | 102.181671 | 33.348335 | 94.964381 | 10.763866 |
1 | 2 | AR | 1985 | 1.076 | 2327000 | 128.534592 | 26210736 | 37.000000 | 101.474998 | 37.000000 | 94.307622 | 10.468165 |
2 | 3 | AZ | 1985 | 1.076 | 3184000 | 104.522614 | 43956936 | 31.000000 | 108.578751 | 36.170418 | 100.909622 | 12.830456 |
3 | 4 | CA | 1985 | 1.076 | 26444000 | 100.363037 | 447102816 | 26.000000 | 107.837341 | 32.104000 | 100.220580 | 15.713321 |
4 | 5 | CO | 1985 | 1.076 | 3209000 | 112.963539 | 49466672 | 31.000000 | 94.266663 | 31.000000 | 87.608425 | 14.326190 |
Now we’ll specify the model. It’s going to be in the form dep ~ exog + [endog ~ instruments]
, where endog will be regressed on instruments and dep will be regressed on both exog and the predicted values of endog.
In this case, the model will be
in the first stage regression and
in the second stage.
results_iv = pf.feols("np.log(packs) ~ np.log(rincome) + 1 | year + state | np.log(rprice) ~ taxs ", data=dfiv, vcov={"CRV1": "year"})
results_iv.summary()
###
Estimation: IV
Dep. var.: np.log(packs), Fixed effects: year+state
Inference: CRV1
Observations: 96
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:----------------|-----------:|-------------:|----------------------:|-----------:|-------:|--------:|
| np.log(rprice) | -1.279 | 0.000 | -6058735433955890.000 | 0.000 | -1.279 | -1.279 |
| np.log(rincome) | 0.443 | 0.000 | 3295534227163963.000 | 0.000 | 0.443 | 0.443 |
---
We can compare the IV results against (naive) OLS. First, run the OLS equivalent:
results_cigs_ols = pf.feols("np.log(packs) ~ np.log(rprice) + np.log(rincome) | year + state", data=dfiv, vcov={"CRV1": "year"})
results_cigs_ols.summary()
###
Estimation: OLS
Dep. var.: np.log(packs), Fixed effects: year+state
Inference: CRV1
Observations: 96
| Coefficient | Estimate | Std. Error | t value | Pr(>|t|) | 2.5% | 97.5% |
|:----------------|-----------:|-------------:|----------------------:|-----------:|-------:|--------:|
| np.log(rprice) | -1.056 | 0.000 | -6960957983563793.000 | 0.000 | -1.056 | -1.056 |
| np.log(rincome) | 0.497 | 0.000 | 5834939776353078.000 | 0.000 | 0.497 | 0.497 |
---
RMSE: 0.044 R2: 0.967 R2 Within: 0.556
Let’s do a proper comparison of these models using etable()
. Note the order: the IV results are the second column of data.
pf.etable([results_cigs_ols, results_iv], type="md")
est1 est2
--------------- ----------------- -----------------
depvar np.log(packs) np.log(packs)
-----------------------------------------------------
np.log(rprice) -1.056*** (0.000) -1.279*** (0.000)
np.log(rincome) 0.497*** (0.000) 0.443*** (0.000)
-----------------------------------------------------
year x x
state x x
-----------------------------------------------------
R2 0.967 -
S.E. type by: year by: year
Observations 96 96
-----------------------------------------------------
Significance levels: * p < 0.05, ** p < 0.01, *** p < 0.001
Format of coefficient cell:
Coefficient (Std. Error)
Once we take into account the fact that the real price is endogenous to (realised) demand, we find that its coefficient is more negative; i.e. an increase in the real price of cigarettes creates a bigger fall in number of packs bought.