Regression
Contents
Regression#
Introduction#
In this chapter, you’ll learn how to run linear regressions with code.
If you’re running this code (either by copying and pasting it, or by downloading it using the icons at the top of the page), you may need to install the packages it uses first. There’s a brief guide to installing packages in the Chapter on Preliminaries.
Most of this chapter will rely on statsmodels with some use of linearmodels. Some of the material in this chapter follows Grant McDermott’s excellent notes and the Library of Statistical Translation.
Quick Review of Econometrics#
Notation#
Greek letters, like \(\beta\), are the truth and represent parameters. Modified Greek letters are an estimate of the truth, for example \(\hat{\beta}\). Sometimes Greek letters will stand in for vectors of parameters. Most of the time, upper case Latin characters such as \(X\) will represent random variables (which could have more than one dimension). Lower case letters from the Latin alphabet denote realised data, for instance \(x\) (which again could be multi-dimensional). Modified Latin alphabet letters denote computations performed on data, for instance \(\bar{x} = \frac{1}{n} \displaystyle\sum_{i} x_i\) where \(n\) is number of samples.
Ordinary least squares (OLS) regression can be used to estimate the parameters of certain types of model, most typically models of the form
This generic model says that the value of an outcome variable \(y\) is a linear function of one or more input predictor variables \(x_i\), where the \(x_i\) could be transforms of original data. But the above equation is a platonic ideal, what we call a data generating process (DGP). OLS allows us to recover estimates of the parameters of the model , i.e. to find \(\hat{\beta_i}\) and to enable us to write an estimated model:
This equation can also be expressed in matrix form as
where \(x' = (1, x_1, \dots, x_{n})'\) and \(\hat{\beta} = (\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_{n}})\).
Given data \(y_i\) stacked to make a vector \(y\) and \(x_{i}\) stacked to make a matrix \(X\), this can be solved for the coefficients \(\hat{\beta}\) according to
Gauss-Markov Conditions#
To be sure that the estimates of these parameters are the best linear unbiased estimate, a few conditions need to hold: the Gauss-Markov conditions:
\(y\) is a linear function of the \(\beta_i\)
\(y\) and the \(x_i\) are randomly sampled from the population.
There is no perfect multi-collinearity of variables.
\(\mathbb{E}(\epsilon | x_1, \dots, x_n) = 0\) (unconfoundedness)
\(\text{Var}(\epsilon | x_1, \dots, x_n) = \sigma^2\) (homoskedasticity)
(1)-(4) also guarantee that OLS estimates are unbiased and \(\mathbb{E}(\hat{\beta}_i) = \beta_i\).
The classic linear model requires a 6th assumption; that \(\epsilon \thicksim \mathcal{N}(0, \sigma^2)\).
The interpretation of regression coefficients depends on what their units are to begin with, but you can always work it out by differentiating both sides of the model equation with respect to the \(x_i\). For example, for the first model equation above
so we get the interpretation that \(\beta_i\) is the rate of change of y with respect to \(x_i\). If \(x_i\) and \(y\) are in levels, this means that a unit increase in \(x_i\) is associated with a \(\beta_i\) units increase in \(y\). If the right-hand side of the model is \(\ln x_i\) then we get
with some abuse of notation, we can rewrite this as \(\partial y = \beta_i \partial x_i/x_i\), which says that a percent change in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\). With a logged \(y\) variable, it’s a percent change in \(x_i\) that is associated with a percent change in \(y\), or \(\partial y/y = \beta_i \partial x_i/x_i\) (note that both sides of this equation are unitless in this case). Finally, another example that is important in practice is that of log differences, eg \(y = \beta_i (\ln x_i - \ln x_i')\). Again, we will abuse notation and say that this case may be represented as \(\partial y = \beta_i (\partial x_i/x_i - \partial x_i'/x_i')\), i.e. the difference in two percentages, a percentage point change, in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\).
Fixed vs Random Effects#
As this is going to come up in this chapter, it’s worth having a quick review of the difference between fixed and random effects in regression-based models. There’s a bit of disagreement about what people really mean when they say “fixed effects” or “random effects”, so let’s define them clearly here.
Random effects are estimated with partial pooling, while fixed effects are not.
What does this mean? Well, technically, if an effect is assumed to be a realised value of a random variable, it is called a random effect. We could say that, in frequentist land (where parameters have true values), a single fixed effect model might look like:
where \(d \in \{0, 1\}\), which could be whether an individual belongs to, say, group 0 or group 1 and \(\gamma\) is a coefficient that is to be estimated. \(\gamma\) has a single “true” value, and we can admire the value of \(\hat{\gamma}\) in a regression table once we have done our estimation.
In contrast, even in a frequentist model, random effects are allowed to vary: they are drawn from a distribution (or, more accurately, their deviations from the true value are drawn from a distribution). This might be written as
and it could be that, instead of a group-specific intercept as in the fixed effects model above, the intercept can take a range of values depending on the individual. Although it’s not always true, another way of thinking about this is that fixed effects are constant across individuals while random effects can vary.
So what’s this business about partial pooling? Imagine we have a situation where one group is under-sampled relative to the other groups. With few data points, we may not be able to precisely estimate fixed effects. However, partial pooling means that, if you do have few data points within a group, the group effect estimate can be partially based on the more abundant data from other groups. Random effects can provide a good compromise between estimating an effect by completely pooling all groups, which masks group-level variation, and estimating an effect for all groups completely separately, which could give imprecise estimates for low-sample groups. For example, when estimating means using a random effects approach, subgroup means are allowed to deviate a bit from the mean of the larger group, but not by an arbitrary amount—those deviations follow a distribution, typically a Gaussian (aka Normal) distribution. That’s where the “random” in random effects comes from: the deviations of subgroups from a “parent group” follow the distribution of a random variable.
The rule of thumb when thinking about how many different categories should exist for a variable to get the random, rather than fixed, effects treatment is that there should be more than 5 or 6 levels at a minimum in order to apply random effects—fewer than this and it’s hard to estimate the variance around the central estimate. Because you could potentially precisely estimated a fixed effect if you had lots of samples in those 5 or 6 levels, you’re more likely to reach for random effects when you have fewer samples per class.
Random effects are the extension of this partial pooling technique to a wide range of situations: including multiple predictors, mixed continuous and categorical variables, and more.
As a conrete example, taken from a forum post by Paul of MassMutual, suppose you want to estimate average US household income by ZIP code. You have a large dataset containing observations of household incomes and ZIP codes. Some ZIP codes are well represented in the dataset, but others have only a couple of households.
For your initial model you would most likely take the mean income in each ZIP. This will work well when you have lots of data for all ZIPs, but the estimates for your poorly sampled ZIPs will suffer from high variance. You can mitigate this by using a shrinkage estimator (aka partial pooling), which will push extreme values towards the mean income across all ZIP codes.
But how much shrinkage/pooling should you do for a particular ZIP? Intuitively, it should depend on the following:
How many observations you have in that ZIP
How many observations you have overall
The individual-level mean and variance of household income across all ZIP codes
The group-level variance in mean household income across all ZIP codes
Now, if you model ZIP code as a random effect, the mean income estimate in all ZIP codes will be subjected to a statistically well-founded shrinkage, taking into account all of these factors.
Random effects models automatically handle 4, the variability estimation, for all random effects in the model. This is useful, as it would be hard to do manually. Having accounted for (1)–(4), a random effects model is able to determine the appropriate shrinkage for low-sample groups. It can also handle much more complicated models with many different predictors.
Mixed effects models are models that combine random and fixed effects.
Imports#
Let’s import some of the packages we’ll be using:
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import os
from pathlib import Path
Regression basics#
There are two ways to run regressions in statsmodels; passing the data directly as objects, and using formulae. We’ll see both but, just to get things started, let’s use the formula API.
We’ll use the starwars dataset to run a regression of mass on height for star wars characters. This example borrows very heavily from notes by Grant McDermott. First, let’s bring the dataset in:
df = pd.read_csv(
"https://github.com/aeturrell/coding-for-economists/raw/main/data/starwars.csv",
index_col=0,
)
# Look at first few rows
df.head()
name | height | mass | hair_color | eye_color | gender | homeworld | species | |
---|---|---|---|---|---|---|---|---|
0 | Luke Skywalker | 172.0 | 77.0 | blond | blue | male | Tatooine | Human |
1 | C-3PO | 167.0 | 75.0 | NaN | yellow | NaN | Tatooine | Droid |
2 | R2-D2 | 96.0 | 32.0 | NaN | red | NaN | Naboo | Droid |
3 | Darth Vader | 202.0 | 136.0 | none | yellow | male | Tatooine | Human |
4 | Leia Organa | 150.0 | 49.0 | brown | brown | female | Alderaan | Human |
Okay, now let’s do a regression using OLS and a formula that says our y-variable is mass and our regressor is height:
results = smf.ols("mass ~ height", data=df).fit()
Well, where are the results!? They’re stored in the object we created. To peek at them we need to call the summary function (and, for easy reading, I’ll print it out too using print
)
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mass R-squared: 0.018
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 1.040
Date: Wed, 01 Feb 2023 Prob (F-statistic): 0.312
Time: 00:42:45 Log-Likelihood: -385.50
No. Observations: 59 AIC: 775.0
Df Residuals: 57 BIC: 779.2
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -13.8103 111.155 -0.124 0.902 -236.393 208.773
height 0.6386 0.626 1.020 0.312 -0.615 1.892
==============================================================================
Omnibus: 128.880 Durbin-Watson: 2.025
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7330.447
Skew: 7.340 Prob(JB): 0.00
Kurtosis: 55.596 Cond. No. 895.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
What we’re seeing here are really several tables glued together. To just grab the coefficients in a tidy format, use
results.summary().tables[1]
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -13.8103 | 111.155 | -0.124 | 0.902 | -236.393 | 208.773 |
height | 0.6386 | 0.626 | 1.020 | 0.312 | -0.615 | 1.892 |
You’ll have noticed that we got an intercept, even though we didn’t specify one in the formula. statsmodels adds in an intercept by default because, most of the time, you will want one. To turn it off, add a -1
at the end of the formula command, eg in this case you would call smf.ols('mass ~ height -1', data=df).fit()
.
The fit we got in the case with the intercept was pretty terrible; a low \(R^2\) and both of our confidence intervals are large and contain zero. What’s going on? If there’s one adage in regression that’s always worth paying attention to, it’s always plot your data. Let’s see what’s going on here:
fig, ax = plt.subplots()
sns.scatterplot(data=df, x="height", y="mass", s=200, ax=ax, legend=False, alpha=0.8)
ax.annotate(
"Jabba the Hutt",
df.iloc[df["mass"].idxmax()][["height", "mass"]],
xytext=(0, -50),
textcoords="offset points",
arrowprops=dict(
arrowstyle="fancy",
color="k",
connectionstyle="arc3,rad=0.3",
),
)
ax.set_ylim(0, None)
ax.set_title("Always plot the data", loc="left")
plt.show()
Oh dear, Jabba’s been on the paddy frogs again, and he’s a bit of different case. When we’re estimating statistical relationships, we have all kinds of choices and should be wary about arbitrary decisions of what to include or exclude in case we fool ourselves about the generality of the relationship we are capturing. Let’s say we knew that we weren’t interested in Hutts though, but only in other species: in that case, it’s fair enough to filter out Jabba and run the regression without this obvious outlier. We’ll exclude any entry that contains the string ‘Jabba’ in the name
column:
results_outlier_free = smf.ols(
"mass ~ height", data=df[~df["name"].str.contains("Jabba")]
).fit()
print(results_outlier_free.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mass R-squared: 0.580
Model: OLS Adj. R-squared: 0.572
Method: Least Squares F-statistic: 77.18
Date: Wed, 01 Feb 2023 Prob (F-statistic): 4.02e-12
Time: 00:42:46 Log-Likelihood: -252.48
No. Observations: 58 AIC: 509.0
Df Residuals: 56 BIC: 513.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -32.5408 12.561 -2.591 0.012 -57.703 -7.379
height 0.6214 0.071 8.785 0.000 0.480 0.763
==============================================================================
Omnibus: 9.951 Durbin-Watson: 1.657
Prob(Omnibus): 0.007 Jarque-Bera (JB): 10.072
Skew: 0.793 Prob(JB): 0.00650
Kurtosis: 4.285 Cond. No. 888.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This looks a lot more healthy. Not only is the model explaining a lot more of the data, but the coefficients are now significant.
Robust regression#
Filtering out data is one way to deal with outliers, but it’s not the only one; an alternative is to use a regression technique that is robust to such outliers. statsmodels has a variety of robust linear models that you can read more about here. To demonstrate the general idea, we will run the regression again but using a robust method.
results_robust = smf.rlm(
"mass ~ height", data=df, M=sm.robust.norms.TrimmedMean(0.5)
).fit()
print(results_robust.summary())
Robust linear Model Regression Results
==============================================================================
Dep. Variable: mass No. Observations: 59
Model: RLM Df Residuals: 57
Method: IRLS Df Model: 1
Norm: TrimmedMean
Scale Est.: mad
Cov Type: H1
Date: Wed, 01 Feb 2023
Time: 00:42:46
No. Iterations: 7
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -31.4968 2.180 -14.451 0.000 -35.769 -27.225
height 0.6273 0.012 51.102 0.000 0.603 0.651
==============================================================================
If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .
There are many different ‘M-estimators’ available; in this case the TrimmedMean estimator gives a very similar result to the regression with the point excluded. We can visualise this, and, well, the results are not really very different in this case. Note that abline_plot
just takes an intercept and coefficient from a fitted model and renders the line that they encode.
fig, ax = plt.subplots()
ax.scatter(df["height"], df["mass"])
sm.graphics.abline_plot(model_results=results_robust, ax=ax, alpha=0.5, label="Robust")
sm.graphics.abline_plot(
model_results=results, ax=ax, color="red", label="OLS", alpha=0.5, ls="--"
)
ax.legend()
ax.set_xlabel("Height")
ax.set_ylabel("Mass")
ax.set_ylim(0, None)
plt.show()
Standard errors#
You’ll have seen that there’s a column for the standard error of the estimates in the regression table and a message saying that the covariance type of these is ‘nonrobust’. Let’s say that, instead, we want to use Eicker-Huber-White robust standard errors, aka “HC2” standard errors. We can specify to use these up front standard errors up front in the fit method:
(smf.ols("mass ~ height", data=df).fit(cov_type="HC2").summary().tables[1])
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -13.8103 | 23.456 | -0.589 | 0.556 | -59.782 | 32.162 |
height | 0.6386 | 0.088 | 7.263 | 0.000 | 0.466 | 0.811 |
Or, alternatively, we can go back to our existing results and recompute the results from those:
print(results.get_robustcov_results("HC2").summary())
OLS Regression Results
==============================================================================
Dep. Variable: mass R-squared: 0.018
Model: OLS Adj. R-squared: 0.001
Method: Least Squares F-statistic: 52.75
Date: Wed, 01 Feb 2023 Prob (F-statistic): 1.16e-09
Time: 00:42:46 Log-Likelihood: -385.50
No. Observations: 59 AIC: 775.0
Df Residuals: 57 BIC: 779.2
Df Model: 1
Covariance Type: HC2
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -13.8103 23.456 -0.589 0.558 -60.779 33.159
height 0.6386 0.088 7.263 0.000 0.463 0.815
==============================================================================
Omnibus: 128.880 Durbin-Watson: 2.025
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7330.447
Skew: 7.340 Prob(JB): 0.00
Kurtosis: 55.596 Cond. No. 895.
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC2)
There are several different types of standard errors available in statsmodels:
‘HC0’, ‘HC1’, ‘HC2’, and ‘HC3’
‘HAC’, for heteroskedasticity and autocorrelation consistent standard errors, for which you may want to also use some keyword arguments
‘hac-groupsum’, for Driscoll and Kraay heteroscedasticity and autocorrelation robust standard errors in panel data, again for which you may have to specify extra keyword arguments
‘hac-panel’, for heteroscedasticity and autocorrelation robust standard errors in panel data, again with keyword arguments; and
‘cluster’ for clustered standard errors.
You can find information on all of these here. For more on standard errors in python, this is a good link.
For now, let’s look more closely at those last ones: clustered standard errors.
Clustered standard errors#
Often, we know something about the structure of likely errors, namely that they occur in groups. In the below example we use one-way clusters to capture this effect in the errors.
Note that in the below example, we grab a subset of the data for which a set of variables we’re interested in are defined, otherwise the below example would execute with an error because of missing cluster-group values.
xf = df.dropna(subset=["homeworld", "mass", "height", "species"])
results_clus = smf.ols("mass ~ height", data=xf).fit(
cov_type="cluster", cov_kwds={"groups": xf["homeworld"]}
)
print(results_clus.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mass R-squared: 0.014
Model: OLS Adj. R-squared: -0.005
Method: Least Squares F-statistic: 39.44
Date: Wed, 01 Feb 2023 Prob (F-statistic): 2.63e-07
Time: 00:42:46 Log-Likelihood: -361.23
No. Observations: 55 AIC: 726.5
Df Residuals: 53 BIC: 730.5
Df Model: 1
Covariance Type: cluster
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -8.7955 29.150 -0.302 0.763 -65.929 48.338
height 0.6159 0.098 6.280 0.000 0.424 0.808
==============================================================================
Omnibus: 121.086 Durbin-Watson: 2.029
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5943.317
Skew: 7.095 Prob(JB): 0.00
Kurtosis: 51.909 Cond. No. 958.
==============================================================================
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)
We can add two-way clustering of standard errors using the following:
xf = df.dropna(subset=["homeworld", "mass", "height", "species"])
two_way_clusters = np.array(xf[["homeworld", "species"]], dtype=str)
results_clus = smf.ols("mass ~ height", data=xf).fit(
cov_type="cluster", cov_kwds={"groups": two_way_clusters}
)
print(results_clus.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mass R-squared: 0.014
Model: OLS Adj. R-squared: -0.005
Method: Least Squares F-statistic: 35.10
Date: Wed, 01 Feb 2023 Prob (F-statistic): 1.96e-06
Time: 00:42:46 Log-Likelihood: -361.23
No. Observations: 55 AIC: 726.5
Df Residuals: 53 BIC: 730.5
Df Model: 1
Covariance Type: cluster
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -8.7955 30.636 -0.287 0.774 -68.841 51.250
height 0.6159 0.104 5.925 0.000 0.412 0.820
==============================================================================
Omnibus: 121.086 Durbin-Watson: 2.029
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5943.317
Skew: 7.095 Prob(JB): 0.00
Kurtosis: 51.909 Cond. No. 958.
==============================================================================
Notes:
[1] Standard Errors are robust to cluster correlation (cluster)
As you would generally expect, the addition of clustering has increased the standard errors.
Fixed effects and categorical variables#
Fixed effects are a way of allowing the intercept of a regression model to vary freely across individuals or groups. It is, for example, used to control for any individual-specific attributes that do not vary across time in panel data.
Let’s use the ‘mtcars’ dataset to demonstrate this. We’ll read it in and set the datatypes of some of the columns at the same time.
mpg = pd.read_csv(
"https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv",
dtype={"model": str, "mpg": float, "hp": float, "disp": float, "cyl": "category"},
)
mpg.head()
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110.0 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110.0 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | Datsun 710 | 22.8 | 4 | 108.0 | 93.0 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110.0 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175.0 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Now we have our data in we want to regress mpg (miles per gallon) on hp (horsepower) with fixed effects for cyl (cylinders). Now we could just pop in a formula like this 'mpg ~ hp + cyl'
because we took the trouble to declare that cyl
was of datatype category when reading it in from the csv file. This means that statsmodels will treat it as a category and use it as a fixed effect by default.
But when I read that formula I get nervous that cyl
might not have been processed correctly (ie it could have been read in as a float, which is what it looks like) and it might just be treated as a float (aka a continuous variable) in the regression. Which is not what we want at all. So, to be safe, and make our intentions explicit (even when the data is of type ‘category’), it’s best to use the syntax C(cyl)
to ask for a fixed effect.
Here’s a regression which does that:
results_fe = smf.ols("mpg ~ hp + C(cyl)", data=mpg).fit()
print(results_fe.summary())
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.754
Model: OLS Adj. R-squared: 0.727
Method: Least Squares F-statistic: 28.59
Date: Wed, 01 Feb 2023 Prob (F-statistic): 1.14e-08
Time: 00:42:46 Log-Likelihood: -79.948
No. Observations: 32 AIC: 167.9
Df Residuals: 28 BIC: 173.8
Df Model: 3
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
Intercept 28.6501 1.588 18.044 0.000 25.398 31.903
C(cyl)[T.6] -5.9677 1.639 -3.640 0.001 -9.326 -2.610
C(cyl)[T.8] -8.5209 2.326 -3.663 0.001 -13.286 -3.756
hp -0.0240 0.015 -1.560 0.130 -0.056 0.008
==============================================================================
Omnibus: 0.251 Durbin-Watson: 1.667
Prob(Omnibus): 0.882 Jarque-Bera (JB): 0.417
Skew: 0.163 Prob(JB): 0.812
Kurtosis: 2.545 Cond. No. 766.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We can see here that two of the three possible values of cyl
:
mpg["cyl"].unique()
['6', '4', '8']
Categories (3, object): ['4', '6', '8']
have been added as fixed effects regressors. The way that +C(cyl)
has been added makes it so that the coefficients given are relative to the coefficient for the intercept. We can turn the intercept off to get a coefficient per unique cyl
value:
print(smf.ols("mpg ~ hp + C(cyl) -1", data=mpg).fit().summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
C(cyl)[4] 28.6501 1.588 18.044 0.000 25.398 31.903
C(cyl)[6] 22.6825 2.228 10.180 0.000 18.119 27.246
C(cyl)[8] 20.1293 3.331 6.042 0.000 13.305 26.953
hp -0.0240 0.015 -1.560 0.130 -0.056 0.008
==============================================================================
When there is an intercept, the coefficients of fixed effect variables can be interpreted as being the average of \(y\) for that class compared to the excluded classes holding all other categories and variables fixed.
High dimensional fixed effects, aka absorbing regression#
Sometimes, you just have a LOT of fixed effects (and perhaps you don’t particularly care about them individually). A common example is having a large number of firms as part of a panel. Fortunately, there are ways to make regression with high dimensional fixed effects be both fast and concise. (In Stata, this is provided by the reghdfe
package.) Here, we will use the linearmodels package, which is built on top of statsmodels.
Let’s say we have a regression of the form
where \(y_i\) are observations indexed by \(i\), \(x_i\) are vectors of exogenous variables we care about the coefficients (\(\beta\)), \(z_i\) are vectors of fixed effects we don’t care too much about the coefficients (\gamma) for, and the \(\epsilon_i\) are errors. Then we can use an absorbing regression to solve for the \(\beta\) while ignoring the \(\gamma\).
Here’s an example using simulated data on workers taken from the linearmodels docs. Let’s simulate some data first, with two fixed effects (state and firm) alongside the two exogenous variables we’re interested in.
from numpy.random import default_rng
rng = default_rng() # Random number generator
# Create synthetic input data
nobs = 1_000_000 # No. observations
state_id = rng.integers(50, size=nobs) # State identifier
firm_id = rng.integers(nobs // 5, size=nobs) # Firm identifier (mean of 5 workers/firm)
x = rng.standard_normal((nobs, 2)) # Exogenous variables
sim = pd.DataFrame(
{
"state_id": pd.Categorical(state_id),
"firm_id": pd.Categorical(firm_id),
"exog_0": x[:, 0],
"exog_1": x[:, 1],
}
)
# Create synthetic relationship
beta = [1, 3] # coefficients of interest
state_effects = rng.standard_normal(state_id.max() + 1)
state_effects = state_effects[state_id] # Generate state fixed effects
firm_effects = rng.standard_normal(firm_id.max() + 1)
firm_effects = firm_effects[firm_id] # Generate firm fixed effects
eps = rng.standard_normal(nobs) # Generate errors
# Generate endogeneous outcome variable
sim["y"] = (
sim["exog_0"] * beta[0]
+ sim["exog_1"] * beta[1]
+ firm_effects
+ state_effects
+ eps
)
sim.head()
state_id | firm_id | exog_0 | exog_1 | y | |
---|---|---|---|---|---|
0 | 30 | 109981 | 0.231898 | 0.227373 | -0.330146 |
1 | 16 | 143756 | -0.826250 | -0.907279 | -6.369599 |
2 | 36 | 178960 | -1.927288 | -2.205871 | -9.504071 |
3 | 25 | 6654 | -0.291475 | 0.711902 | 0.364891 |
4 | 26 | 180110 | 0.653378 | -0.146633 | 2.239330 |
Now we pass this to linearmodels and with the state_id
and firm_id
variables entered via the absorb
keyword argument:
from linearmodels.iv.absorbing import AbsorbingLS
mod = AbsorbingLS(
sim["y"], sim[["exog_0", "exog_1"]], absorb=sim[["state_id", "firm_id"]]
)
print(mod.fit())
/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/linearmodels/iv/absorbing.py:710: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
missing |= self._absorb_inter.cat.isnull().any(1).to_numpy()
/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/linearmodels/iv/absorbing.py:711: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
missing |= self._absorb_inter.cont.isnull().any(1).to_numpy()
Absorbing LS Estimation Summary
==================================================================================
Dep. Variable: y R-squared: 0.9369
Estimator: Absorbing LS Adj. R-squared: 0.9213
No. Observations: 1000000 F-statistic: 9.808e+06
Date: Wed, Feb 01 2023 P-value (F-stat): 0.0000
Time: 00:42:51 Distribution: chi2(2)
Cov. Estimator: robust R-squared (No Effects): 0.9091
Varaibles Absorbed: 1.987e+05
Parameter Estimates
==============================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
------------------------------------------------------------------------------
exog_0 0.9995 0.0010 988.51 0.0000 0.9975 1.0014
exog_1 2.9988 0.0010 2972.2 0.0000 2.9969 3.0008
==============================================================================
So, from our 1,000,000 observations, we have roughly 200,000 fixed effects that have been scooped up and packed away, leaving us with just the coefficients, \(\beta\), on the exogenous variables of interest.
Transformations of regressors#
This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of (potentially non-linear) transform of the original features \(x_i\).
Logs and arcsinh#
You have two options for adding in logs: do them before, or do them in the formula. Doing them before just makes use of standard dataframe operations to declare a new column:
mpg["lnhp"] = np.log(mpg["hp"])
print(smf.ols("mpg ~ lnhp", data=mpg).fit().summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 72.6405 6.004 12.098 0.000 60.378 84.903
lnhp -10.7642 1.224 -8.792 0.000 -13.265 -8.264
==============================================================================
Alternatively, you can specify the log directly in the formula:
results_ln = smf.ols("mpg ~ np.log(hp)", data=mpg).fit()
print(results_ln.summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 72.6405 6.004 12.098 0.000 60.378 84.903
np.log(hp) -10.7642 1.224 -8.792 0.000 -13.265 -8.264
==============================================================================
Clearly, the first method will work for arcsinh(x)
and log(x+1)
, but you can also pass both of these into the formula directly too. (For more on the pros and cons of arcsinh, see Bellemare and Wichman [2020].) Here it is with arcsinh:
print(smf.ols("mpg ~ np.arcsinh(hp)", data=mpg).fit().summary().tables[1])
==================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------
Intercept 80.1041 6.850 11.694 0.000 66.115 94.094
np.arcsinh(hp) -10.7646 1.224 -8.792 0.000 -13.265 -8.264
==================================================================================
Interaction terms and powers#
This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of non-linear transform of the original features \(x_i\). Two of the most common transformations that you might want to use are interaction terms and polynomial terms. An example of an interaction term would be
while an example of a polynomial term would be
i.e. the last term enters only after it is multiplied by itself.
One note of warning: the interpretation of the effect of a variable is no longer as simple as was set out at the start of this chapter. To work out what the new interpretation is, the procedure is the same though: just take the derivative. In the case of the interaction model above, the effect of a unit change in \(x_1\) on \(y\) is now going to be a function of \(x_2\). In the case of the polynomial model above, the effect of a unit change in \(x_1\) on \(y\) will be \(2\beta_1 \cdot x_1\). For more on interaction terms, see Balli and Sørensen [2013].
Alright, with all of that preamble out of the way, let’s see how we actual do some of this! Let’s try including a linear and squared term in the regression of mpg
on hp
making use of the numpy power function:
res_poly = smf.ols("mpg ~ hp + np.power(hp, 2)", data=mpg).fit()
print(res_poly.summary().tables[1])
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
Intercept 40.4091 2.741 14.744 0.000 34.804 46.015
hp -0.2133 0.035 -6.115 0.000 -0.285 -0.142
np.power(hp, 2) 0.0004 9.84e-05 4.275 0.000 0.000 0.001
===================================================================================
Now let’s include the original term in hp, a term in disp, and the interaction between them, which is represented by hp:disp in the table.
res_inter = smf.ols("mpg ~ hp * disp", data=mpg).fit()
print(res_inter.summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 39.6743 2.914 13.614 0.000 33.705 45.644
hp -0.0979 0.025 -3.956 0.000 -0.149 -0.047
disp -0.0734 0.014 -5.100 0.000 -0.103 -0.044
hp:disp 0.0003 8.69e-05 3.336 0.002 0.000 0.000
==============================================================================
In the unusual case that you want only the interaction term, you write it as it appears in the table above:
print(smf.ols("mpg ~ hp : disp", data=mpg).fit().summary().tables[1])
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 25.8099 1.005 25.677 0.000 23.757 27.863
hp:disp -0.0001 1.91e-05 -7.409 0.000 -0.000 -0.000
==============================================================================
The formula API explained#
As you will have seen ~
separates the left- and right-hand sides of the regression. +
computes a set union, which will also be familiar from the examples above (ie it inludes two terms as long as they are distinct). -
computes a set difference; it adds the set of terms to the left of it while removing any that appear on the right of it. As we’ve seen, a*b
is a short-hand for a + b + a:b
, with the last term representing the interaction. /
is short hand for a + a:b
, which is useful if, for example b
is nested within a
, so it doesn’t make sense to control for b
on its own. Actually, the :
character can interact multiple terms so that (a + b):(d + c)
is the same as a:c + a:d + b:c + b:d
. C(a)
tells statsmodels to treat a
as a categorical variable that will be included as a fixed effect. Finally, as we saw above with powers, you can also pass in vectorised functions, such as np.log
and np.power
, directly into the formulae.
One gotcha with the formula API is ensuring that you have sensible variable names in your dataframe, i.e. ones that do not include whitespace or, to take a really pathological example, have the name ‘a + b’ for one of the columns that you want to regress on. You can dodge this kind of problem by passing in the variable name as, for example, Q("a + b")
to be clear that the column name is anything within the Q("...")
.
Multiple regression models#
As is so often the case, you’re likely to want to run more than one model at once with different specifications. Although there is a base version of this in statsmodels, called summary_col
, which you can find an example of here, instead we’ll be using the stargazer package to assemble the regressions together in a table.
In the above examples, we’ve collected a few different regression results. Let’s put them together:
from stargazer.stargazer import Stargazer
stargazer_tab = Stargazer([results_ln, res_poly, res_inter])
stargazer_tab
Dependent variable:mpg | |||
(1) | (2) | (3) | |
Intercept | 72.640^{***} | 40.409^{***} | 39.674^{***} |
(6.004) | (2.741) | (2.914) | |
disp | -0.073^{***} | ||
(0.014) | |||
hp | -0.213^{***} | -0.098^{***} | |
(0.035) | (0.025) | ||
hp:disp | 0.000^{***} | ||
(0.000) | |||
np.log(hp) | -10.764^{***} | ||
(1.224) | |||
np.power(hp, 2) | 0.000^{***} | ||
(0.000) | |||
Observations | 32 | 32 | 32 |
R^{2} | 0.720 | 0.756 | 0.820 |
Adjusted R^{2} | 0.711 | 0.739 | 0.801 |
Residual Std. Error | 3.239 (df=30) | 3.077 (df=29) | 2.692 (df=28) |
F Statistic | 77.301^{***} (df=1; 30) | 44.953^{***} (df=2; 29) | 42.475^{***} (df=3; 28) |
Note: | ^{*}p<0.1; ^{**}p<0.05; ^{***}p<0.01 |
There are lots of customisation options, including ones that add a title, rename variables, add notes, and so on. What is most useful is that as well as the HTML friendly output that you can see above, the package also exports to latex:
print(stargazer_tab.render_latex())
\begin{table}[!htbp] \centering
\begin{tabular}{@{\extracolsep{5pt}}lccc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
& \multicolumn{3}{c}{\textit{Dependent variable:}} \
\cr \cline{3-4}
\\[-1.8ex] & (1) & (2) & (3) \\
\hline \\[-1.8ex]
Intercept & 72.640$^{***}$ & 40.409$^{***}$ & 39.674$^{***}$ \\
& (6.004) & (2.741) & (2.914) \\
disp & & & -0.073$^{***}$ \\
& & & (0.014) \\
hp & & -0.213$^{***}$ & -0.098$^{***}$ \\
& & (0.035) & (0.025) \\
hp:disp & & & 0.000$^{***}$ \\
& & & (0.000) \\
np.log(hp) & -10.764$^{***}$ & & \\
& (1.224) & & \\
np.power(hp, 2) & & 0.000$^{***}$ & \\
& & (0.000) & \\
\hline \\[-1.8ex]
Observations & 32 & 32 & 32 \\
$R^2$ & 0.720 & 0.756 & 0.820 \\
Adjusted $R^2$ & 0.711 & 0.739 & 0.801 \\
Residual Std. Error & 3.239(df = 30) & 3.077(df = 29) & 2.692(df = 28) \\
F Statistic & 77.301$^{***}$ (df = 1.0; 30.0) & 44.953$^{***}$ (df = 2.0; 29.0) & 42.475$^{***}$ (df = 3.0; 28.0) \\
\hline
\hline \\[-1.8ex]
\textit{Note:} & \multicolumn{3}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
\end{tabular}
\end{table}
And of course this can be written to a file using open('regression.tex', 'w').write(stargazer.render_latex())
where you can get your main latex compilation to scoop it up and use it.
Specifying regressions without formulae, using the array API#
As noted, there are two ways to run regressions in statsmodels; passing the data directly as objects, and using formulae. We’ve seen the formula API, now let’s see how to specify regressions using arrays with the format sm.OLS(y, X)
.
We will first need to take the data out of the pandas dataframe and put it into a couple of arrays. When we’re not using the formula API, the default is to treat the array X as the design matrix for the regression-so, if it doesn’t have a column of constants in, there will be no intercept in the regression. Therefore, we need to add a constant vector to the matrix X
if we do want an intercept. Use sm.add_constant(X)
for this.
X = np.array(xf["height"])
y = np.array(xf["mass"])
X = sm.add_constant(X)
results = sm.OLS(y, X).fit()
print(results.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.014
Model: OLS Adj. R-squared: -0.005
Method: Least Squares F-statistic: 0.7446
Date: Wed, 01 Feb 2023 Prob (F-statistic): 0.392
Time: 00:42:51 Log-Likelihood: -361.23
No. Observations: 55 AIC: 726.5
Df Residuals: 53 BIC: 730.5
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const -8.7955 127.191 -0.069 0.945 -263.909 246.318
x1 0.6159 0.714 0.863 0.392 -0.816 2.048
==============================================================================
Omnibus: 121.086 Durbin-Watson: 2.029
Prob(Omnibus): 0.000 Jarque-Bera (JB): 5943.317
Skew: 7.095 Prob(JB): 0.00
Kurtosis: 51.909 Cond. No. 958.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
This approach seems a lot less convenient, not to mention less clear, so you may be wondering when it is useful. It’s useful when you want to do many regressions in a systematic way or when you don’t know what the columns of a dataset will be called ahead of time. It can actually be a little bit simpler to specify for more complex regressions too.
Fixed effects in the array API#
If you’re using the formula API, it’s easy to turn a regressor x
into a fixed effect by putting C(x)
into the model formula, as you’ll see in the next section.
For the array API, things are not that simple and you need to use dummy variables. Let’s say we have some data like this:
# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(seed_for_prng) # prng=probabilistic random number generator
no_obs = 200
X = pd.DataFrame(prng.normal(size=no_obs))
X[1] = prng.choice(["a", "b"], size=no_obs)
# Get this a numpy array
X = X.values
# Create the y data, adding in a bit of noise
y = X[:, 0] * 2 + 0.5 + prng.normal(scale=0.1, size=no_obs)
y = [el_y + 1.5 if el_x == "a" else el_y + 3.4 for el_y, el_x in zip(y, X[:, 1])]
X[:5, :]
array([[0.927388713624307, 'b'],
[-0.05975401040013389, 'a'],
[1.0437358922863453, 'a'],
[1.5056188793532617, 'b'],
[0.10762743224577262, 'a']], dtype=object)
The first feature (column) is of numbers and it’s clear how we include it. The second, however, is a grouping that we’d like to include as a fixed effect. But if we just throw this matrix into sm.OLS(y, X)
, we’re going to get trouble because statsmodels isn’t sure what to do with a vector of strings. So, instead, we need to create some dummy variables out of our second column of data
Astonishingly, there are several popular ways to create dummy variables in Python: scikit-learn’s OneHotEncoder
and pandas’ get_dummies
being my favourites. Let’s use the latter here.
pd.get_dummies(X[:, 1])
a | b | |
---|---|---|
0 | 0 | 1 |
1 | 1 | 0 |
2 | 1 | 0 |
... | ... | ... |
197 | 1 | 0 |
198 | 1 | 0 |
199 | 0 | 1 |
200 rows × 2 columns
We just need to pop this into our matrix \(X\):
X = np.column_stack([X[:, 0], pd.get_dummies(X[:, 1])])
X = np.array(X, dtype=float)
X[:5, :]
array([[ 0.92738871, 0. , 1. ],
[-0.05975401, 1. , 0. ],
[ 1.04373589, 1. , 0. ],
[ 1.50561888, 0. , 1. ],
[ 0.10762743, 1. , 0. ]])
Okay, so now we’re ready to do our regression:
print(sm.OLS(y, X).fit().summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.998
Model: OLS Adj. R-squared: 0.998
Method: Least Squares F-statistic: 4.302e+04
Date: Wed, 01 Feb 2023 Prob (F-statistic): 6.90e-261
Time: 00:42:51 Log-Likelihood: 165.45
No. Observations: 200 AIC: -324.9
Df Residuals: 197 BIC: -315.0
Df Model: 2
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
x1 1.9893 0.008 263.045 0.000 1.974 2.004
x2 1.9924 0.011 176.203 0.000 1.970 2.015
x3 3.8977 0.010 384.415 0.000 3.878 3.918
==============================================================================
Omnibus: 2.096 Durbin-Watson: 2.116
Prob(Omnibus): 0.351 Jarque-Bera (JB): 2.017
Skew: 0.245 Prob(JB): 0.365
Kurtosis: 2.955 Cond. No. 1.50
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Perhaps you can see why I generally prefer the formula API…
Instrumental variables#
Rather than use statsmodels for IV, we’ll use the linearmodels package, which has very clean documentation (indeed, this sub-section is indebted to that documentation).
Recall that a good instrumental variable \(z\) has zero covariance with the error from the regression (which is untestable) and non-zero covariance with the variable of interest (which is).
Recall that in IV regression, we have a model of the form
where \(x_{1i}\) is a set of \(k_1\) exogenous regressors and \(x_{2i}\) is a set of \(k_2\) endogenous regressors such that \(\text{Cov}(x_{2i}, \epsilon_i)\neq 0\). This is a problem for the usual OLS assumptions (the right-hand side should be exogenous).
To get around this, in 2-stage least squares IV, we first regress \(x_{2i}\) on instruments that explain \(x_{2i}\) but not \(y_i\), and then regress \(y_i\) only on the predicted/estimated left-hand side from the first regression, ie on \(\hat{x_{2i}}\). There are other estimators than IV2SLS, but I think that one has the most intuitive explanation of what’s going.
As well as a 2-stage least squares estimator called IV2SLS
, linearmodels has a Limited Information Maximum Likelihood (LIML) estimator IVLIML
, a Generalized Method of Moments (GMM) estimator IVGMM
, and a Generalized Method of Moments using the Continuously Updating Estimator (CUE) IVGMMCUE
.
Just as with OLS via statsmodels, there’s an option to use an array API for the linearmodels IV methods.
It’s always easiest to see an example, so let’s estimate what might cause (realised) cigarette demand for the 48 continental US states in 1995 with IV2SLS
. First we need to import the estimator, IV2SLS
, and the data:
from linearmodels.iv import IV2SLS
df = pd.read_csv(
"https://vincentarelbundock.github.io/Rdatasets/csv/AER/CigarettesSW.csv",
dtype={"state": "category", "year": "category"},
).assign(
rprice=lambda x: x["price"] / x["cpi"],
rincome=lambda x: x["income"] / x["population"] / x["cpi"],
)
df.head()
Unnamed: 0 | state | year | cpi | population | packs | income | tax | price | taxs | rprice | rincome | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | AL | 1985 | 1.076 | 3973000.0 | 116.486282 | 46014968 | 32.500004 | 102.181671 | 33.348335 | 94.964381 | 10.763866 |
1 | 2 | AR | 1985 | 1.076 | 2327000.0 | 128.534592 | 26210736 | 37.000000 | 101.474998 | 37.000000 | 94.307622 | 10.468165 |
2 | 3 | AZ | 1985 | 1.076 | 3184000.0 | 104.522614 | 43956936 | 31.000000 | 108.578751 | 36.170418 | 100.909622 | 12.830456 |
3 | 4 | CA | 1985 | 1.076 | 26444000.0 | 100.363037 | 447102816 | 26.000000 | 107.837341 | 32.104000 | 100.220580 | 15.713321 |
4 | 5 | CO | 1985 | 1.076 | 3209000.0 | 112.963539 | 49466672 | 31.000000 | 94.266663 | 31.000000 | 87.608425 | 14.326190 |
Now we’ll specify the model. It’s going to be in the form dep ~ exog + [endog ~ instruments]
, where endog will be regressed on instruments and dep will be regressed on both exog and the predicted values of endog.
In this case, the model will be
in the first stage regression and
in the second stage.
results_iv2sls = IV2SLS.from_formula(
"np.log(packs) ~ 1 + np.log(rincome) + C(year) + C(state) + [np.log(rprice) ~ taxs]",
df,
).fit(cov_type="clustered", clusters=df["year"])
print(results_iv2sls.summary)
IV-2SLS Estimation Summary
==============================================================================
Dep. Variable: np.log(packs) R-squared: 0.9659
Estimator: IV-2SLS Adj. R-squared: 0.9279
No. Observations: 96 F-statistic: -1.349e+16
Date: Wed, Feb 01 2023 P-value (F-stat) 1.0000
Time: 00:42:51 Distribution: chi2(50)
Cov. Estimator: clustered
Parameter Estimates
===================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
-----------------------------------------------------------------------------------
Intercept 9.4924 0.0263 360.24 0.0000 9.4407 9.5440
C(state)[T.AR] 0.1770 0.0531 3.3338 0.0009 0.0729 0.2810
C(state)[T.AZ] -0.0899 0.0132 -6.8132 0.0000 -0.1158 -0.0640
C(state)[T.CA] -0.2781 0.0214 -12.996 0.0000 -0.3200 -0.2361
C(state)[T.CO] -0.2479 0.0090 -27.625 0.0000 -0.2655 -0.2303
C(state)[T.CT] -0.0171 0.0196 -0.8720 0.3832 -0.0556 0.0213
C(state)[T.DE] 0.1110 0.0291 3.8105 0.0001 0.0539 0.1682
C(state)[T.FL] 0.0762 0.0142 5.3596 0.0000 0.0483 0.1041
C(state)[T.GA] -0.0695 0.0251 -2.7706 0.0056 -0.1186 -0.0203
C(state)[T.IA] 0.0120 0.0739 0.1629 0.8706 -0.1328 0.1569
C(state)[T.ID] -0.1272 0.0077 -16.597 0.0000 -0.1423 -0.1122
C(state)[T.IL] -0.0339 0.0081 -4.1912 0.0000 -0.0497 -0.0180
C(state)[T.IN] 0.1198 0.0611 1.9609 0.0499 5.573e-05 0.2395
C(state)[T.KS] -0.0910 0.0305 -2.9884 0.0028 -0.1507 -0.0313
C(state)[T.KY] 0.3525 0.0631 5.5906 0.0000 0.2289 0.4761
C(state)[T.LA] 0.1315 0.0104 12.664 0.0000 0.1112 0.1519
C(state)[T.MA] -0.0403 0.0069 -5.8826 0.0000 -0.0538 -0.0269
C(state)[T.MD] -0.2322 0.0239 -9.7376 0.0000 -0.2790 -0.1855
C(state)[T.ME] 0.2008 0.0574 3.5011 0.0005 0.0884 0.3133
C(state)[T.MI] 0.1268 0.0745 1.7009 0.0890 -0.0193 0.2728
C(state)[T.MN] 0.0568 0.0490 1.1595 0.2463 -0.0392 0.1529
C(state)[T.MO] 0.0640 0.0476 1.3454 0.1785 -0.0292 0.1572
C(state)[T.MS] 0.1501 0.0272 5.5267 0.0000 0.0969 0.2034
C(state)[T.MT] -0.1522 0.0054 -28.250 0.0000 -0.1627 -0.1416
C(state)[T.NC] 0.0396 0.0191 2.0655 0.0389 0.0020 0.0771
C(state)[T.ND] -0.0311 0.0399 -0.7787 0.4361 -0.1092 0.0471
C(state)[T.NE] -0.0741 0.0375 -1.9765 0.0481 -0.1476 -0.0006
C(state)[T.NH] 0.3504 0.0315 11.114 0.0000 0.2886 0.4122
C(state)[T.NJ] -0.0873 6.107e-05 -1429.3 0.0000 -0.0874 -0.0872
C(state)[T.NM] -0.2858 0.0040 -71.049 0.0000 -0.2937 -0.2779
C(state)[T.NV] 0.1789 0.0259 6.9075 0.0000 0.1281 0.2296
C(state)[T.NY] -0.0719 0.0032 -22.256 0.0000 -0.0782 -0.0655
C(state)[T.OH] 0.0325 0.0402 0.8088 0.4186 -0.0463 0.1114
C(state)[T.OK] 0.0946 0.0538 1.7572 0.0789 -0.0109 0.2000
C(state)[T.OR] -0.0153 0.0673 -0.2269 0.8205 -0.1471 0.1166
C(state)[T.PA] -0.0031 0.0006 -4.8401 0.0000 -0.0044 -0.0019
C(state)[T.RI] 0.1394 0.0921 1.5136 0.1301 -0.0411 0.3200
C(state)[T.SC] -0.0212 0.0334 -0.6345 0.5257 -0.0866 0.0442
C(state)[T.SD] -0.0675 0.0711 -0.9488 0.3427 -0.2069 0.0719
C(state)[T.TN] 0.1473 0.0470 3.1340 0.0017 0.0552 0.2394
C(state)[T.TX] -0.0579 0.0136 -4.2560 0.0000 -0.0845 -0.0312
C(state)[T.UT] -0.4899 0.0276 -17.776 0.0000 -0.5440 -0.4359
C(state)[T.VA] -0.0559 0.0471 -1.1875 0.2350 -0.1482 0.0364
C(state)[T.VT] 0.2209 0.0467 4.7267 0.0000 0.1293 0.3125
C(state)[T.WA] 0.0064 0.0011 6.0151 0.0000 0.0043 0.0085
C(state)[T.WI] 0.0741 0.0590 1.2569 0.2088 -0.0415 0.1897
C(state)[T.WV] 0.1576 0.0582 2.7097 0.0067 0.0436 0.2716
C(state)[T.WY] -0.0169 0.0590 -0.2858 0.7750 -0.1325 0.0988
C(year)[T.1995] -0.0328 9.853e-09 -3.33e+06 0.0000 -0.0328 -0.0328
np.log(rincome) 0.4434
np.log(rprice) -1.2793 1.73e-09 -7.397e+08 0.0000 -1.2793 -1.2793
===================================================================================
Endogenous: np.log(rprice)
Instruments: taxs
Clustered Covariance (One-Way)
Debiased: False
Num Clusters: 2
/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/linearmodels/iv/results.py:180: RuntimeWarning: invalid value encountered in sqrt
std_errors = sqrt(diag(self.cov))
We sort of skipped a step here and did everything all in one go. If we did want to know how our first stage regression went, we can just pass a formula to IV2SLS
without the part in square brackets, [...]
, and it will run regular OLS.
But, in this case, there’s an easier way: we can print out a set of handy 1st stage statistics from running the full model.
print(results_iv2sls.first_stage)
First Stage Estimation Results
=========================================
np.log(rprice)
-----------------------------------------
R-squared 0.9760
Partial R-squared 0.7070
Shea's R-squared 0.7070
Partial F-statistic 6.353e+16
P-value (Partial F-stat) 0.0000
Partial F-stat Distn chi2(1)
========================== ==============
Intercept 4.4375
(410.34)
C(state)[T.AR] -0.0190
(-3.5717)
C(state)[T.AZ] 0.0552
(20.534)
C(state)[T.CA] 0.0969
(5.3807)
C(state)[T.CO] -0.0044
(-0.1117)
C(state)[T.CT] 0.1269
(9.0641)
C(state)[T.DE] 0.0254
(7.9851)
C(state)[T.FL] 0.0598
(4.0067)
C(state)[T.GA] -0.0049
(-0.3455)
C(state)[T.IA] 0.0135
(0.5305)
C(state)[T.ID] 0.0297
(79.832)
C(state)[T.IL] 0.0476
(12.008)
C(state)[T.IN] -0.0460
(-76.127)
C(state)[T.KS] -0.0019
(-0.0855)
C(state)[T.KY] -0.0783
(-3.6663)
C(state)[T.LA] 0.0298
(2.6211)
C(state)[T.MA] 0.0816
(5.3671)
C(state)[T.MD] -0.0199
(-0.5429)
C(state)[T.ME] 0.0370
(1.8744)
C(state)[T.MI] 0.0363
(1.9899)
C(state)[T.MN] 0.0932
(8.6047)
C(state)[T.MO] -0.0056
(-12.033)
C(state)[T.MS] 0.0147
(2.1430)
C(state)[T.MT] -0.0183
(-9.6530)
C(state)[T.NC] -0.0672
(-1.8041)
C(state)[T.ND] 0.0130
(1.8765)
C(state)[T.NE] 0.0105
(2.4846)
C(state)[T.NH] -0.0175
(-0.6334)
C(state)[T.NJ] 0.0694
(5.2527)
C(state)[T.NM] 0.0301
(2.6464)
C(state)[T.NV] 0.1075
(16.434)
C(state)[T.NY] 0.0847
(5.1334)
C(state)[T.OH] -0.0197
(-22.592)
C(state)[T.OK] -0.0035
(-0.7109)
C(state)[T.OR] 0.0146
(0.2944)
C(state)[T.PA] 0.0152
(0.9986)
C(state)[T.RI] 0.0249
(0.5964)
C(state)[T.SC] -0.0530
(-2.1043)
C(state)[T.SD] -0.0190
(-1.4001)
C(state)[T.TN] 0.0051
(0.5517)
C(state)[T.TX] 0.0378
(4.0760)
C(state)[T.UT] 0.0577
(5.8788)
C(state)[T.VA] 0.0265
(0.5400)
C(state)[T.VT] 0.0228
(1.0969)
C(state)[T.WA] 0.1554
(13.780)
C(state)[T.WI] 0.0723
(5.1954)
C(state)[T.WV] 0.0205
(1.7212)
C(state)[T.WY] -0.0005
(-0.0220)
C(year)[T.1995] 0.0822
( )
np.log(rincome) -0.0296
(-8.462e+06)
taxs 0.0051
(2.521e+08)
-----------------------------------------
T-stats reported in parentheses
T-stats use same covariance type as original model
/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/linearmodels/iv/results.py:180: RuntimeWarning: invalid value encountered in sqrt
std_errors = sqrt(diag(self.cov))
There are more tests and checks available. For example, Wooldridge’s regression test of exogeneity uses regression residuals from the endogenous variables regressed on the exogenous variables and the instrument to test for endogenity and is available to run on fitted model results. Let’s check that:
results_iv2sls.wooldridge_regression
Wooldridge's regression test of exogeneity
H0: Endogenous variables are exogenous
Statistic: 3332097971792274.0000
P-value: 0.0000
Distributed: chi2(1)
WaldTestStatistic, id: 0x7f9ffd62ae20
We can compare the IV results against (naive) OLS. First, run the OLS equivalent:
res_cig_ols = IV2SLS.from_formula(
"np.log(packs) ~ 1 + np.log(rincome) + C(year) + C(state) + np.log(rprice)", df
).fit(cov_type="clustered", clusters=df["year"])
Now select these two models to compare:
from collections import OrderedDict
from linearmodels.iv.results import compare
res = OrderedDict()
res["OLS"] = res_cig_ols
res["2SLS"] = results_iv2sls
print(compare(res))
Model Comparison
========================================================
OLS 2SLS
--------------------------------------------------------
Dep. Variable np.log(packs) np.log(packs)
Estimator OLS IV-2SLS
No. Observations 96 96
Cov. Est. clustered clustered
R-squared 0.9675 0.9659
Adj. R-squared 0.9313 0.9279
F-statistic -1.209e+17 -1.349e+16
P-value (F-stat) 1.0000 1.0000
================== =============== ===============
Intercept 8.3597 9.4924
(484.80) (360.24)
C(state)[T.AR] 0.1686 0.1770
(3.7460) (3.3338)
C(state)[T.AZ] -0.1280 -0.0899
(-49.908) (-6.8132)
C(state)[T.CA] -0.3320 -0.2781
(-9.0787) (-12.996)
C(state)[T.CO] -0.2591 -0.2479
(-345.21) (-27.625)
C(state)[T.CT] -0.1042 -0.0171
(-3.8346) (-0.8720)
C(state)[T.DE] 0.0900 0.1110
(3.3042) (3.8105)
C(state)[T.FL] 0.0325 0.0762
(1.9171) (5.3596)
C(state)[T.GA] -0.0691 -0.0695
(-2.4532) (-2.7706)
C(state)[T.IA] -0.0143 0.0120
(-0.2380) (0.1629)
C(state)[T.ID] -0.1422 -0.1272
(-8.4099) (-16.597)
C(state)[T.IL] -0.0766 -0.0339
(-10.214) (-4.1912)
C(state)[T.IN] 0.1231 0.1198
(2.0976) (1.9609)
C(state)[T.KS] -0.1073 -0.0910
(-4.8883) (-2.9884)
C(state)[T.KY] 0.3803 0.3525
(6.6476) (5.5906)
C(state)[T.LA] 0.1174 0.1315
(12.648) (12.664)
C(state)[T.MA] -0.1055 -0.0403
(-10.125) (-5.8826)
C(state)[T.MD] -0.2572 -0.2322
(-62.161) (-9.7376)
C(state)[T.ME] 0.1682 0.2008
(3.7655) (3.5011)
C(state)[T.MI] 0.0652 0.1268
(1.4699) (1.7009)
C(state)[T.MN] -0.0051 0.0568
(-0.1580) (1.1595)
C(state)[T.MO] 0.0600 0.0640
(1.2722) (1.3454)
C(state)[T.MS] 0.1472 0.1501
(6.1510) (5.5267)
C(state)[T.MT] -0.1469 -0.1522
(-26.043) (-28.250)
C(state)[T.NC] 0.0609 0.0396
(7.1838) (2.0655)
C(state)[T.ND] -0.0595 -0.0311
(-1.9467) (-0.7787)
C(state)[T.NE] -0.1002 -0.0741
(-3.4255) (-1.9765)
C(state)[T.NH] 0.3374 0.3504
(14.331) (11.114)
C(state)[T.NJ] -0.1458 -0.0873
(-11.223) (-1429.3)
C(state)[T.NM] -0.2981 -0.2858
(-28.465) (-71.049)
C(state)[T.NV] 0.1219 0.1789
(3.2128) (6.9075)
C(state)[T.NY] -0.1369 -0.0719
(-5.8164) (-22.256)
C(state)[T.OH] 0.0196 0.0325
(0.5379) (0.8088)
C(state)[T.OK] 0.0841 0.0946
(1.6659) (1.7572)
C(state)[T.OR] -0.0380 -0.0153
(-0.7768) (-0.2269)
C(state)[T.PA] -0.0333 -0.0031
(-16.905) (-4.8401)
C(state)[T.RI] 0.0900 0.1394
(1.3979) (1.5136)
C(state)[T.SC] -0.0037 -0.0212
(-0.1369) (-0.6345)
C(state)[T.SD] -0.0695 -0.0675
(-1.1052) (-0.9488)
C(state)[T.TN] 0.1373 0.1473
(3.2924) (3.1340)
C(state)[T.TX] -0.0964 -0.0579
(-3.8212) (-4.2560)
C(state)[T.UT] -0.5127 -0.4899
(-16.600) (-17.776)
C(state)[T.VA] -0.0628 -0.0559
(-1.7642) (-1.1875)
C(state)[T.VT] 0.2053 0.2209
(5.3662) (4.7267)
C(state)[T.WA] -0.0777 0.0064
(-6.5481) (6.0151)
C(state)[T.WI] 0.0260 0.0741
(0.5209) (1.2569)
C(state)[T.WV] 0.1490 0.1576
(2.4854) (2.7097)
C(state)[T.WY] -0.0150 -0.0169
(-0.2771) (-0.2858)
C(year)[T.1995] -0.0885 -0.0328
(-3.33e+06)
np.log(rincome) 0.4974 0.4434
(1.115e+07)
np.log(rprice) -1.0560 -1.2793
(-7.397e+08)
==================== ================= =================
Instruments taxs
--------------------------------------------------------
T-stats reported in parentheses
/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/linearmodels/iv/results.py:180: RuntimeWarning: invalid value encountered in sqrt
std_errors = sqrt(diag(self.cov))
Once we take into account the fact that the real price is endogeneous to (realised) demand, we find that its coefficient is more negative; i.e. an increase in the real price of cigarettes creates a bigger fall in number of packs bought.
Logit, probit, and generalised linear models#
Logit#
A logistical regression, aka a logit, is a statistical method for a best-fit line between a regressors \(X\) and an outcome varibale \(y\) that takes on values in \((0, 1)\).
The function that we’re assuming links the regressors and the outcome has a few different names but the most common is the sigmoid function or the logistic function. The data generating process is assumed to be
we can also write this as \(\ln\left(\frac{p}{p-1}\right) = \beta_0 + \sum_i \beta_i x_i\) to get a ‘log-odds’ relationship. The coefficients from a logit model do not have the same interpration as in an OLS estimation, and you can see this from the fact that \(\partial y/\partial x_i \neq \beta_i\) for logit. Of course, you can work out what the partial derivative is for yourself but most packages offer a convenient way to quickly recover the marginal effects.
Logit models are available in scikit-learn and statsmodels but bear in mind that the scikit-learn logit model is, ermm, extremely courageous in that regularisation is applied by default. If you don’t know what that means, don’t worry, but it’s probably best to stick with statsmodels as we will do in this example.
We will predict a target GRADE
, representing whether a grade improved or not, based on some regressors including participation in a programme.
# Load the data from Spector and Mazzeo (1980)
df = sm.datasets.spector.load_pandas().data
# Look at info on data
print(sm.datasets.spector.NOTE)
::
Number of Observations - 32
Number of Variables - 4
Variable name definitions::
Grade - binary variable indicating whether or not a student's grade
improved. 1 indicates an improvement.
TUCE - Test score on economics test
PSI - participation in program
GPA - Student's grade point average
res_logit = smf.logit("GRADE ~ GPA + TUCE + PSI", data=df).fit()
print(res_logit.summary())
Optimization terminated successfully.
Current function value: 0.402801
Iterations 7
Logit Regression Results
==============================================================================
Dep. Variable: GRADE No. Observations: 32
Model: Logit Df Residuals: 28
Method: MLE Df Model: 3
Date: Wed, 01 Feb 2023 Pseudo R-squ.: 0.3740
Time: 00:42:51 Log-Likelihood: -12.890
converged: True LL-Null: -20.592
Covariance Type: nonrobust LLR p-value: 0.001502
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -13.0213 4.931 -2.641 0.008 -22.687 -3.356
GPA 2.8261 1.263 2.238 0.025 0.351 5.301
TUCE 0.0952 0.142 0.672 0.501 -0.182 0.373
PSI 2.3787 1.065 2.234 0.025 0.292 4.465
==============================================================================
So, did participation (PSI
) help increase a grade? Yes. But we need to check the marginal effect to say exactly how much. We’ll use get_margeff
to do this, we’d like the \(dy/dx\) effect, and we’ll take it at the mean of each regressor.
marg_effect = res_logit.get_margeff(at="mean", method="dydx")
marg_effect.summary()
Dep. Variable: | GRADE |
---|---|
Method: | dydx |
At: | mean |
dy/dx | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
GPA | 0.5339 | 0.237 | 2.252 | 0.024 | 0.069 | 0.998 |
TUCE | 0.0180 | 0.026 | 0.685 | 0.493 | -0.033 | 0.069 |
PSI | 0.4493 | 0.197 | 2.284 | 0.022 | 0.064 | 0.835 |
So participation gives almost half a grade increase.
Probit#
Probit is very similar to logit: it’s a statistical method for a best-fit line between regressors \(X\) and an outcome varibale \(y\) that takes on values in \((0, 1)\). And, just like with logit, the function that we’re assuming links the regressors and the outcome has a few different names!
The data generating process is assumed to be
where
is the cumulative standard normal (aka Gaussian) distribution. The coefficients from a probit model do not have the same interpration as in an OLS estimation, and you can see this from the fact that \(\partial y/\partial x_i \neq \beta_i\) for probit. And, just as with logit, although you can derive the marginal effects, most packages offer a convenient way to quickly recover them.
We can re-use our previous example of predicting a target GRADE
, representing whether a grade improved or not, based on some regressors including participation (PSI) in a programme.
res_probit = smf.probit("GRADE ~ GPA + TUCE + PSI", data=df).fit()
print(res_probit.summary())
Optimization terminated successfully.
Current function value: 0.400588
Iterations 6
Probit Regression Results
==============================================================================
Dep. Variable: GRADE No. Observations: 32
Model: Probit Df Residuals: 28
Method: MLE Df Model: 3
Date: Wed, 01 Feb 2023 Pseudo R-squ.: 0.3775
Time: 00:42:51 Log-Likelihood: -12.819
converged: True LL-Null: -20.592
Covariance Type: nonrobust LLR p-value: 0.001405
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -7.4523 2.542 -2.931 0.003 -12.435 -2.469
GPA 1.6258 0.694 2.343 0.019 0.266 2.986
TUCE 0.0517 0.084 0.617 0.537 -0.113 0.216
PSI 1.4263 0.595 2.397 0.017 0.260 2.593
==============================================================================
p_marg_effect = res_probit.get_margeff(at="mean", method="dydx")
p_marg_effect.summary()
Dep. Variable: | GRADE |
---|---|
Method: | dydx |
At: | mean |
dy/dx | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
GPA | 0.5333 | 0.232 | 2.294 | 0.022 | 0.078 | 0.989 |
TUCE | 0.0170 | 0.027 | 0.626 | 0.531 | -0.036 | 0.070 |
PSI | 0.4679 | 0.188 | 2.494 | 0.013 | 0.100 | 0.836 |
It’s no coincidence that we find very similar results here because the two functions we’re using don’t actually look all that different:
import scipy.stats as st
fig, ax = plt.subplots()
support = np.linspace(-6, 6, 1000)
ax.plot(support, st.logistic.cdf(support), ls="--", label="Logistic")
ax.plot(support, st.norm.cdf(support), label="Probit")
ax.legend()
ax.set_ylim(0, None)
ax.set_ylim(0, None)
plt.show()
What difference there is, is that logistic regression puts more weight into the tails of the distribution. Arguably, logit is easier to interpret too. With logistic regression, a one unit change in \(x_i\) is associated with a \(\beta_i\) change in the log odds of a 1 outcome or, alternatively, an \(e^{\beta_i}\)-fold change in the odds, all else being equal. With a probit, this is a change of \(\beta_i z\) for \(z\) a normalised variable that you’d have to convert into a predicted probability using the normal CDF.
Generalised linear models#
Logit and probit (and OLS for that matter) as special cases of a class of models such that \(g\) is a ‘link’ function connects a function of regressors to the output, and \(\mu\) is the mean of a conditional response distribution at a given point in the space of regressors. When \(g(\mu) = X'\beta\), we just get regular OLS. When it’s logit, we have
But as well as the ones we’ve seen, there are many possible link functions one can use via the catch-all glm
function. These come in different ‘families’ of distributions, with the default for the binomial family being logit. So, running smf.glm('GRADE ~ GPA + TUCE + PSI', data=df, family=sm.families.Binomial()).fit()
will produce exactly the same as we got both using the logit
function. For more on the families of distributions and possible link functions, see the relevant part of the statsmodels documentation.
If you need a library dedicated to GLMs that has all the bells and whistles you can dream of, you might want to check out glum. At the time of writing, it is faster than either GLMnet or H2O (two other popular GLM libraries).
Linear Mixed Effects Regressions (LMER)#
Linear mixed models are an extension of simple linear models to allow both fixed and random effects.
We’ll be using the statsmodels package to demonstrate this, along with the dietox dataset, which has data on the growth curves of pigs. The data are taken from a study of the “Influence of Dietary Rapeseed Oli, Vitamin E, and Copper on Performance and Antioxidant and Oxidative Status of Pigs” Lauridsen et al. [1999].
data = sm.datasets.get_rdataset('dietox', 'geepack').data
data.head()
Pig | Evit | Cu | Litter | Start | Weight | Feed | Time | |
---|---|---|---|---|---|---|---|---|
0 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 26.50000 | NaN | 1 |
1 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 27.59999 | 5.200005 | 2 |
2 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 36.50000 | 17.600000 | 3 |
3 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 40.29999 | 28.500000 | 4 |
4 | 4601 | Evit000 | Cu000 | 1 | 26.5 | 49.09998 | 45.200001 | 5 |
Now we’re going to regress the weight of the pigs on the time and a couple of random effects.
We fit two random effects for each pig: a random intercept, and a random slope (with respect to time). This means that each pig may have a different baseline weight, as well as grow at a different rate.
md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"], re_formula="~Time")
mdf = md.fit(method=["lbfgs"])
mdf.summary()
Model: | MixedLM | Dependent Variable: | Weight |
No. Observations: | 861 | Method: | REML |
No. Groups: | 72 | Scale: | 6.0372 |
Min. group size: | 11 | Log-Likelihood: | -2217.0475 |
Max. group size: | 12 | Converged: | Yes |
Mean group size: | 12.0 |
Coef. | Std.Err. | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 15.739 | 0.550 | 28.603 | 0.000 | 14.660 | 16.817 |
Time | 6.939 | 0.080 | 86.925 | 0.000 | 6.783 | 7.095 |
Group Var | 19.503 | 1.561 | ||||
Group x Time Cov | 0.294 | 0.153 | ||||
Time Var | 0.416 | 0.033 |
This is saying that we have 72 pigs with 861 observations between them. The “Group Var” here is the random effect intercept for pigs, and it has a variance of 19.5. Meanwhile the slope random effect has a variance of 0.416. In addition to these, there are the two fixed effects. It’s easier to see where all of these estimates come out by plotting them. We’ll be honest, plotting the estimates hasn’t been made easy—so getting the data out and configuring it isn’t as easy as it could be.
First we pop the random effects results into a dataframe
group_time_ests = pd.DataFrame.from_dict(mdf.random_effects).T
group_time_ests["position_group"] = "Group"
group_time_ests["position_time"] = "Time"
group_time_ests["Group"] = group_time_ests["Group"] + mdf.fe_params["Intercept"]
group_time_ests["Time"] = group_time_ests["Time"] + mdf.fe_params["Time"]
group_time_ests.head()
Group | Time | position_group | position_time | |
---|---|---|---|---|
4601 | 15.109694 | 6.848209 | Group | Time |
4602 | 16.133875 | 7.364919 | Group | Time |
4603 | 17.846672 | 7.235199 | Group | Time |
4605 | 19.771567 | 7.374041 | Group | Time |
4641 | 17.265125 | 8.143678 | Group | Time |
Now let’s plot the data:
import seaborn.objects as so
details_dict = {"Intercept": 0, "Time": 1}
fig, ax = plt.subplots()
ax.set_title("Random effects: Pig model")
for key, value in details_dict.items():
ax.scatter(mdf.fe_params[key], value, s=50, color="black", edgecolor="k", zorder=3)
ax.errorbar(mdf.fe_params[key], value, xerr=mdf.conf_int().loc[key].diff().dropna(), ls='', lw=4, color="k")
sns_plot = (
so.Plot(group_time_ests, "Group", "position_group")
.add(so.Dots(), so.Jitter())
.add(so.Dots(), so.Jitter(), x="Time", y="position_time")
.label(x="Estimate", y="Value")
.on(ax)
.plot()
)
Linear probability model#
When \(y\) takes values in \(\{0, 1\}\) but the model looks like
and is estimated by OLS then you have a linear probability model. In this case, the interpretion of a unit change in \(x_i\) is that it induces a \(\beta_i\) change in probability of \(y\). Note that homoskedasticity does not hold for the linear probability model.
Violations of the classical linear model (CLM)#
Heteroskedasticity#
If an estimated model is homoskedastic then its random variables have equal (finite) variance. This is also known as homogeneity of variance. Another way of putting it is that, for all observations \(i\) in an estimated model \(y_i = X_i\hat{\beta} + \epsilon_i\) then
When this relationship does not hold, an estimated model is said to be heteroskedastic.
To test for heteroskedasticity, you can use statsmodels’ versions of the Breusch-Pagan or White tests with the null hypothesis that the estimated model is homoskedastic. If the null hypothesis is rejected, then standard errors, t-statistics, and F-statistics are invalidated. In this case, you will need HAC (heteroskedasticity and auto-correlation consistent) standard errors, t- and F-statistics.
To obtain HAC standard errors from existing regression results in a variable results
, you can use (for 1 lag):
results.get_robustcov_results('HAC', maxlags=1).summary()
Quantile regression#
Quantile regression estimates the conditional quantiles of a response variable. In some cases, it can be more robust to outliers and, in the case of the \(q=0.5\) quantile it is equivalent LAD (Least Absolute Deviation) regression. Let’s look at an example of quantile regression in action, lifted direct from the statsmodels documentation and based on a Journal of Economic Perspectives paper by Koenker and Hallock.
df = sm.datasets.engel.load_pandas().data
df.head()
income | foodexp | |
---|---|---|
0 | 420.157651 | 255.839425 |
1 | 541.411707 | 310.958667 |
2 | 901.157457 | 485.680014 |
3 | 639.080229 | 402.997356 |
4 | 750.875606 | 495.560775 |
What we have here are two sets of related data. Let’s perform several quantile regressions from 0.1 to 0.9 in steps of 0.1
mod = smf.quantreg("foodexp ~ income", df)
quantiles = np.arange(0.1, 1.0, 0.1)
q_results = [mod.fit(q=x) for x in quantiles]
The \(q=0.5\) entry will be at the 4
index; let’s take a look at it:
print(q_results[4].summary())
QuantReg Regression Results
==============================================================================
Dep. Variable: foodexp Pseudo R-squared: 0.6206
Model: QuantReg Bandwidth: 64.51
Method: Least Squares Sparsity: 209.3
Date: Wed, 01 Feb 2023 No. Observations: 235
Time: 00:42:53 Df Residuals: 233
Df Model: 1
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 81.4823 14.634 5.568 0.000 52.649 110.315
income 0.5602 0.013 42.516 0.000 0.534 0.586
==============================================================================
The condition number is large, 2.38e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Let’s take a look at the results for all of the regressions and let’s add in OLS for comparison:
ols_res = smf.ols("foodexp ~ income", df).fit()
get_y = lambda a, b: a + b * x
x = np.arange(df.income.min(), df.income.max(), 50)
# Just to make the plot clearer
x_max = 3000
x = x[x < x_max]
fig, ax = plt.subplots(figsize=(8, 6))
df.plot.scatter(
ax=ax, x="income", y="foodexp", alpha=0.7, s=10, zorder=2, edgecolor=None
)
for i, res in enumerate(q_results):
y = get_y(res.params["Intercept"], res.params["income"])
ax.plot(x, y, color="grey", lw=0.5, zorder=0, linestyle=(0, (5, 10)))
ax.annotate(f"$q={quantiles[i]:1.1f}$", xy=(x.max(), y.max()))
y = get_y(ols_res.params["Intercept"], ols_res.params["income"])
ax.plot(x, y, color="red", label="OLS", zorder=0)
ax.legend()
ax.set_xlim(0, x_max)
plt.show()
This chart shows very clearly how quantile regression differs from OLS. The line fitted by OLS is trying to be all things to all points whereas the line fitted by quantile regression is focused only on its quantile. You can also see how points far from the median (not all shown) may be having a large influence on the OLS line.
Rolling and recursive regressions#
Rolling ordinary least squares applies OLS (ordinary least squares) across a fixed window of observations and then rolls (moves or slides) that window across the data set. They key parameter is window
, which determines the number of observations used in each OLS regression. Recursive regression is equivalent to rolling regression but with a window that expands over time.
Let’s first create some synthetic data to perform estimation on:
from statsmodels.regression.rolling import RollingOLS
import statsmodels.api as sm
from sklearn.datasets import make_regression
X, y = make_regression(n_samples=200, n_features=2, random_state=0, noise=4.0, bias=0)
df = pd.DataFrame(X).rename(columns={0: "feature0", 1: "feature1"})
df["target"] = y
df.head()
feature0 | feature1 | target | |
---|---|---|---|
0 | -0.955945 | -0.345982 | -36.740556 |
1 | -1.225436 | 0.844363 | 7.190031 |
2 | -0.692050 | 1.536377 | 44.389018 |
3 | 0.010500 | 1.785870 | 57.019515 |
4 | -0.895467 | 0.386902 | -16.088554 |
Now let’s fit the model using a formula and a window
of 25 steps.
roll_reg = RollingOLS.from_formula(
"target ~ feature0 + feature1 -1", window=25, data=df
)
model = roll_reg.fit()
Note that -1 in the formala suppresses the intercept. We can see the parameters using model.params
. Here are the params for time steps between 20 and 30:
model.params[20:30]
feature0 | feature1 | |
---|---|---|
20 | NaN | NaN |
21 | NaN | NaN |
22 | NaN | NaN |
... | ... | ... |
27 | 20.532655 | 34.919468 |
28 | 20.470171 | 35.365235 |
29 | 20.002261 | 35.666997 |
10 rows × 2 columns
Note that there aren’t parameters for entries between 0 and 23 because our window is 25 steps wide. We can easily look at how any of the coefficients are changing over time. Here’s an example for ‘feature0’.
fig = model.plot_recursive_coefficient(variables=["feature0"])
plt.xlabel("Time step")
plt.ylabel("Coefficient value")
plt.show()
A rolling regression with an expanding rather than moving window is effectively a recursive least squares model. We can do this instead using the RecursiveLS
function from statsmodels. Let’s fit this to the whole dataset:
reg_rls = sm.RecursiveLS.from_formula("target ~ feature0 + feature1 -1", df)
model_rls = reg_rls.fit()
print(model_rls.summary())
Statespace Model Results
==============================================================================
Dep. Variable: target No. Observations: 200
Model: RecursiveLS Log Likelihood -570.923
Date: Wed, 01 Feb 2023 R-squared: 0.988
Time: 00:42:54 AIC 1145.847
Sample: 0 BIC 1152.444
- 200 HQIC 1148.516
Covariance Type: nonrobust Scale 17.413
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
feature0 20.6872 0.296 69.927 0.000 20.107 21.267
feature1 34.0655 0.302 112.870 0.000 33.474 34.657
===================================================================================
Ljung-Box (L1) (Q): 2.02 Jarque-Bera (JB): 3.93
Prob(Q): 0.16 Prob(JB): 0.14
Heteroskedasticity (H): 1.17 Skew: -0.31
Prob(H) (two-sided): 0.51 Kurtosis: 3.31
===================================================================================
Warnings:
[1] Parameters and covariance matrix estimates are RLS estimates conditional on the entire sample.
But now we can look back at how the values of the coefficients changed over time too:
fig = model_rls.plot_recursive_coefficient(
range(reg_rls.k_exog), legend_loc="upper right"
)
ax_list = fig.axes
for ax in ax_list:
ax.set_xlim(0, None)
ax_list[-1].set_xlabel("Time step")
ax_list[0].set_title("Coefficient value");
Regression plots#
statsmodels has a number of built-in plotting methods to help you understand how well your regression is capturing the relationships you’re looking for. Let’s see a few examples of these using statsmodels built-in Statewide Crime data set:
crime_data = sm.datasets.statecrime.load_pandas()
print(sm.datasets.statecrime.NOTE)
::
Number of observations: 51
Number of variables: 8
Variable name definitions:
state
All 50 states plus DC.
violent
Rate of violent crimes / 100,000 population. Includes murder, forcible
rape, robbery, and aggravated assault. Numbers for Illinois and
Minnesota do not include forcible rapes. Footnote included with the
American Statistical Abstract table reads:
"The data collection methodology for the offense of forcible
rape used by the Illinois and the Minnesota state Uniform Crime
Reporting (UCR) Programs (with the exception of Rockford, Illinois,
and Minneapolis and St. Paul, Minnesota) does not comply with
national UCR guidelines. Consequently, their state figures for
forcible rape and violent crime (of which forcible rape is a part)
are not published in this table."
murder
Rate of murders / 100,000 population.
hs_grad
Percent of population having graduated from high school or higher.
poverty
% of individuals below the poverty line
white
Percent of population that is one race - white only. From 2009 American
Community Survey
single
Calculated from 2009 1-year American Community Survey obtained obtained
from Census. Variable is Male householder, no wife present, family
household combined with Female householder, no husband present, family
household, divided by the total number of Family households.
urban
% of population in Urbanized Areas as of 2010 Census. Urbanized
Areas are area of 50,000 or more people.
First, let’s look at a Q-Q plot to get a sense of how the variables are distributed. This uses scipy’s stats module. The default distribution is normal but you can use any that scipy supports.
st.probplot(crime_data.data["murder"], dist="norm", plot=plt);
Clearly, this is not quite normal and there are some serious outliers in the tails.
Let’s run take a look at the unconditional relationship we’re interested in: how murder depends on high school graduation. We’ll use plotnine’s geom_smooth
to do this but bear in mind it will only run a linear model of 'murder ~ hs_grad'
and ignore the other covariates.
from plotnine import *
(
ggplot(crime_data.data, aes(y="murder", x="hs_grad"))
+ geom_point()
+ geom_smooth(method="lm")
)
<ggplot: (8770281830876)>
We can take into account those other factors by using a partial regression plot that asks what does \(\mathbb{E}(y|X)\) look like as a function of \(\mathbb{E}(x_i|X)\)? (Use obs_labels=False
to remove data point labels.)
with plt.rc_context({"font.size": 5}):
sm.graphics.plot_partregress(
endog="murder",
exog_i="hs_grad",
exog_others=["urban", "poverty", "single"],
data=crime_data.data,
obs_labels=True,
)
plt.show()
eval_env: 1
At this point, the results of the regression are useful context.
results_crime = smf.ols(
"murder ~ hs_grad + urban + poverty + single", data=crime_data.data
).fit()
print(results_crime.summary())
OLS Regression Results
==============================================================================
Dep. Variable: murder R-squared: 0.813
Model: OLS Adj. R-squared: 0.797
Method: Least Squares F-statistic: 50.08
Date: Wed, 01 Feb 2023 Prob (F-statistic): 3.42e-16
Time: 00:42:56 Log-Likelihood: -95.050
No. Observations: 51 AIC: 200.1
Df Residuals: 46 BIC: 209.8
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -44.1024 12.086 -3.649 0.001 -68.430 -19.774
hs_grad 0.3059 0.117 2.611 0.012 0.070 0.542
urban 0.0109 0.015 0.707 0.483 -0.020 0.042
poverty 0.4121 0.140 2.939 0.005 0.130 0.694
single 0.6374 0.070 9.065 0.000 0.496 0.779
==============================================================================
Omnibus: 1.618 Durbin-Watson: 2.507
Prob(Omnibus): 0.445 Jarque-Bera (JB): 0.831
Skew: -0.220 Prob(JB): 0.660
Kurtosis: 3.445 Cond. No. 5.80e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Putting the multicollinearity problems to one side, we see that the relationship shown in the partial regression plot is also implied by the coefficient on hs_grad
in the regression table.
We can also look at an in-depth summary of one exogenous regressor and its relationship to the outcome variable. Each of these types of regression diagnostic are available individually, or for all regressors at once, too. The first panel is the chart we did with plotnine rendered differently (and, one could argue, more informatively). Most of the plots below are self-explanatory except for the third one, the CCPR (Component-Component plus Residual) plot. This provides a way to judge the effect of one regressor on the response variable by taking into account the effects of the other independent variables.
fig = plt.figure(figsize=(8, 6), dpi=150)
sm.graphics.plot_regress_exog(results_crime, "hs_grad", fig=fig)
plt.tight_layout()
plt.show()
eval_env: 1
statsmodels can also produce influence plots of the ‘externally studentised’ residuals vs. the leverage of each observation as measured by the so-called hat matrix \(X(X^{\;\prime}X)^{-1}X^{\;\prime}\) (because it puts the ‘hat’ on \(y\)). Externally studentised residuals are residuals that are scaled by their standard deviation. High leverage points could exert an undue influence over the regression line, but only if the predicted \(y\) values of a regression that was fit with them excluded was quite different. In the example below, DC is having a big influence.
with plt.rc_context({"font.size": 6}):
sm.graphics.influence_plot(results_crime)
Finally, it’s nice to be able to see plots of our coefficients along with their standard errors. There isn’t a built-in statsmodels option for this, but happily it’s easy to extract the results of regressions in a sensible format. Using the results
object from earlier, and excluding the intercept, we can get the coefficients from results.params[1:]
and the associated errors from results.bse[1:]
.
# Put the results into a dataframe with Name, Coefficient, Error
res_df = (
pd.concat([results_crime.params[1:], results_crime.bse[1:]], axis=1)
.reset_index()
.rename(columns={"index": "Name", 0: "Coefficient", 1: "Error"})
)
# Plot the coefficient values and their errors
(
ggplot(res_df)
+ geom_point(aes("Name", "Coefficient"))
+ geom_errorbar(aes(x="Name", ymin="Coefficient-Error", ymax="Coefficient+Error"))
)
<ggplot: (8770281952485)>
Binscatter#
We’re going to follow the excellent example in this blog post by Matteo Courthoud.
First we’ll generate some fake data:
def dgp_marketplace(N=10_000):
# Does the firm sells only online?
online = prng.binomial(1, 0.5, N)
# How many products does the firm have
products = 1 + prng.poisson(1, N)
# What is the age of the firm
t = prng.exponential(0.5*products, N)
# Sales
sales = 1e3 * prng.exponential(products + np.maximum((1 + 0.3*products + 4*online)*t - 0.5*(1 + 6*online)*t**2, 0), N)
# Generate the dataframe
df = pd.DataFrame({'age': t, 'sales': sales, 'online': online, 'products': products})
return df
sales = dgp_marketplace(N=10_000)
sales.head()
age | sales | online | products | |
---|---|---|---|---|
0 | 2.507089 | 1383.896707 | 0 | 2 |
1 | 0.074963 | 2738.994208 | 1 | 2 |
2 | 0.259299 | 10876.933496 | 0 | 6 |
3 | 1.626234 | 12159.344328 | 0 | 2 |
4 | 0.026536 | 1584.629056 | 0 | 3 |
We have pulled down information on 10,000 firms. For each firm we know:
age
: the age of the firmsales
: the monthly sales from last monthonline
: whether the firm is only active onlineproducts
: the number of products that the firm offers
Suppose we are interested in understanding the relationship between age and sales. What is the life-cycle of sales?
Let’s start with a simple scatterplot of sales vs age.
import seaborn.objects as so
(
so.Plot(sales, x="age", y="sales")
.add(so.Dot(edgecolor="k"))
)
This plot is far too crowded to get a real sense of the relationship here.
A linear regression can help us unpick what the relationship really is
smf.ols('np.log(sales) ~ np.log(age)', sales).fit().summary().tables[1]
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 7.3964 | 0.015 | 488.757 | 0.000 | 7.367 | 7.426 |
np.log(age) | 0.1503 | 0.010 | 15.588 | 0.000 | 0.131 | 0.169 |
There is a pretty strong positive relationship here that wasn’t evident in the scatter plot. BUT it’s possible that this relationships depends on whether the firms are online-only or not. We can condition on the other variables and do another regression:
smf.ols('np.log(sales) ~ np.log(age) + C(online) + products', sales).fit().summary().tables[1]
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.5413 | 0.036 | 181.200 | 0.000 | 6.471 | 6.612 |
C(online)[T.1] | 0.1798 | 0.026 | 6.888 | 0.000 | 0.129 | 0.231 |
np.log(age) | 0.0642 | 0.010 | 6.456 | 0.000 | 0.045 | 0.084 |
products | 0.3521 | 0.014 | 25.191 | 0.000 | 0.325 | 0.379 |
The coefficient now looks very different having conditioned on the other confounders. We can see that sales still increase with age, but we don’t know if this relationship is linear or not. We could add extra terms (like age squared) to explore this.
Or, we could use a binned scatterplot to analyse the relationship non-parametrically. The binned scatterplot divides the conditioning variable, age
in our example, into equally sized bins or quantiles, and then plots the conditional mean of the dependent variable, sales
in our example, within each bin.
We’ll be using the binsreg package for this Cattaneo et al. [2019].
Binned scatterplots do not just provide conditional means for optimally chosen intervals; they can also provide inference for these means. In particular, they will give confidence intervals around each data point. In the binsreg package, the ci
option adds confidence intervals to the estimation results.
As the binsreg package is not written in the most Pythonic way, we’ll adorn it a bit to make it more easily used with dataframes.
import binsreg
def binscatter(**kwargs):
# Estimate binsreg
est = binsreg.binsreg(**kwargs)
# Retrieve estimates
df_est = pd.concat([d.dots for d in est.data_plot])
df_est = df_est.rename(columns={'x': kwargs.get("x"), 'fit': kwargs.get("y")})
# Add confidence intervals
if "ci" in kwargs:
df_est = pd.merge(df_est, pd.concat([d.ci for d in est.data_plot]))
df_est = df_est.drop(columns=['x'])
df_est['ci'] = df_est['ci_r'] - df_est['ci_l']
# Rename groups
if "by" in kwargs:
df_est['group'] = df_est['group'].astype(kwargs.get("data")[kwargs.get("by")].dtype)
df_est = df_est.rename(columns={'group': kwargs.get("by")})
return df_est
# Estimate binsreg
br_est = binscatter(x='age', y='sales', data=sales, ci=(3,3))
br_est.head()
group | age | bin | isknot | mid | sales | ci_l | ci_r | ci | |
---|---|---|---|---|---|---|---|---|---|
0 | Full Sample | 0.010859 | 0 | 0 | 0 | 1464.787623 | 1071.515840 | 1583.280880 | 511.765040 |
1 | Full Sample | 0.033589 | 1 | 0 | 0 | 1796.147300 | 1597.311734 | 2008.197828 | 410.886094 |
2 | Full Sample | 0.059490 | 2 | 0 | 0 | 1859.670036 | 1671.846295 | 2109.052975 | 437.206680 |
3 | Full Sample | 0.086708 | 3 | 0 | 0 | 1886.426934 | 1669.795790 | 2095.450639 | 425.654848 |
4 | Full Sample | 0.114412 | 4 | 0 | 0 | 1938.436076 | 1709.891878 | 2248.915920 | 539.024041 |
We can now plot the results
fig, ax = plt.subplots()
sns_plot = (
so.Plot(br_est, x="age", y="sales")
.add(so.Dots())
)
ax.errorbar('age', 'sales', yerr='ci', data=br_est, ls='', lw=2, alpha=0.2)
ax.set_title("Binscatter: sales as a function of age")
sns_plot.on(ax).show()
We can now see that the relationship is non-linear with a sharp increase in sales
at the beginning of the lifetime of a firm, followed by a plateau.
As noted, this relationship may be modified by other variables though. So let’s now condition on those.
To condition on products
, we pass w=["products"]
to the binscatter
function.
# Estimate binsreg
br_est = binscatter(x='age', y='sales', w=['products'], data=sales, ci=(3,3))
br_est.head()
group | age | bin | isknot | mid | sales | ci_l | ci_r | ci | |
---|---|---|---|---|---|---|---|---|---|
0 | Full Sample | 0.011594 | 0 | 0 | 0 | 1993.431855 | 1696.328657 | 2194.308173 | 497.979516 |
1 | Full Sample | 0.036091 | 1 | 0 | 0 | 2205.494696 | 2020.644696 | 2413.209800 | 392.565104 |
2 | Full Sample | 0.063776 | 2 | 0 | 0 | 2456.340262 | 2223.387911 | 2629.754210 | 406.366299 |
3 | Full Sample | 0.093066 | 3 | 0 | 0 | 2346.037534 | 2173.715566 | 2574.592536 | 400.876970 |
4 | Full Sample | 0.122806 | 4 | 0 | 0 | 2425.527896 | 2160.148380 | 2686.659801 | 526.511421 |
Now let’s look at the plot of this:
fig, ax = plt.subplots()
sns_plot = (
so.Plot(br_est, x="age", y="sales")
.add(so.Dots())
)
ax.errorbar('age', 'sales', yerr='ci', data=br_est, ls='', lw=2, alpha=0.2)
ax.set_title("Binscatter: sales as a function of age conditioned on products")
sns_plot.on(ax).show()
Conditional on the number of products
, the shape of the sales
life-cycle changes further. Now, after an initial increase in sales
, we observe a gradual decrease over time.
Do online-only firms have different sales
life-cycles with respect to mixed online-offline firms? We can produce different binned scatterplots by group using the option by
.
# Estimate binsreg
br_est = binscatter(x='age', y='sales', w=['products'], by="online", data=sales, ci=(3,3))
br_est["online"] = br_est["online"].astype("category")
# plot the two groups
fig, ax = plt.subplots(figsize=(7, 5))
sns_plot = (
so.Plot(br_est, x="age", y="sales", color="online")
.add(so.Dots(pointsize=7))
)
for value in br_est["online"].unique():
ax.errorbar('age', 'sales', yerr='ci', data=br_est.loc[br_est["online"]==value], ls='', lw=2, alpha=0.2)
ax.set_title("Binscatter: sales as a function of age conditioned on products", size=16)
sns_plot.on(ax).show()
From the binned scatterplot, we can see that online products have on average shorter lifetimes, with a higher initial peak in sales, followed by a sharper decline.
Hopefully this shows you how you can use binsreg to get a better understanding of causal relationships in your data.
You can find out more about bin scatters in this video by Paul Goldsmith-Pinkham.
Specification curve analysis#
When specifying a model, modellers have many options. These can be informed by field intelligence, priors, and even misguided attempts to find a significant result. Even with the best of intentions, research teams can reach entirely different conclusions using the same, or similar, data because of different choices made in preparing data or in modelling it.
There’s formal evidence that researchers really do make different decisions; one study Silberzahn et al. [2018] gave the same research question - whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players - to 29 different teams. From the abstract of that paper:
Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability.
So not only were different decisions made, there seems to be no clearly identifiable reason for them. There is usually scope for reasonable alternative model specifications when estimating coefficients, and those coefficients will vary with those specifications.
Specification curve analysis Simonsohn et al. [2020] looks for a more exhaustive way of trying out alternative specifications. The three steps of specification curve analysis are:
identifying the set of theoretically justified, statistically valid, and non-redundant analytic specifications;
displaying alternative results graphically, allowing the identification of decisions producing different results; and
conducting statistical tests to determine whether as a whole results are inconsistent with the null hypothesis.
For a good example of specification curve analysis in action, see this recent Nature Human Behaviour paper Orben and Przybylski [2019] on the association between adolescent well-being and the use of digital technology.
We’ll use the specification curve analysis package to do the first two, which you can install with pip install specification_curve
. To demonstrate the full functionality, we’ll create a second, alternative ‘hp’ that is a transformed version of the original.
mpg["hp_boxcox"], _ = st.boxcox(mpg["hp"])
Now let’s create a specification curve. We need to specify the data, the different outcome variables we’d like to try, y_endog
; the different possible versions of the main regressor of interest, x_exog
; the possible controls, controls
; any controls that should always be included, always_include
; and any categorical variables to include class-by-class, cat_expand
. Some of these accept lists of variables as well as single reggressors. The point estimates that have confidence intervals which include zero are coloured in grey, instead of blue. There is also an exclu_grps
option to exclude certain combinations of regressors, and you can pass alternative estimators to fit, for example fit(estimator=sm.Logit)
.
import specification_curve as specy
sc = specy.SpecificationCurve(
mpg,
y_endog="mpg",
x_exog=["lnhp", "hp_boxcox"],
controls=["drat", "qsec", "cyl", "gear"],
always_include=["gear"],
cat_expand="cyl",
)
sc.fit()
sc.plot()
Fit complete
Review#
In this very short introduction to regression with code, you should have learned how to:
✅ perform linear OLS regressions with code;
✅ add fixed effects/categorical variables to regressions;
✅ use different standard errors;
✅ use models with transformed regressors;
✅ use the formula or array APIs for statsmodels and linearmodels;
✅ show the results from multiple models;
✅ perform IV regressions;
✅ perform GLM regressions; and
✅ use plots as a way to interrogate regression results.