Regression#

Introduction#

In this chapter, you’ll learn how to run linear regressions with code.

If you’re running this code (either by copying and pasting it, or by downloading it using the icons at the top of the page), you may need to install the packages it uses first. There’s a brief guide to installing packages in the Chapter on Preliminaries.

Most of this chapter will rely on statsmodels with some use of linearmodels. Some of the material in this chapter follows Grant McDermott’s excellent notes and the Library of Statistical Translation.

Quick Review of Econometrics#

Notation#

Greek letters, like \(\beta\), are the truth and represent parameters. Modified Greek letters are an estimate of the truth, for example \(\hat{\beta}\). Sometimes Greek letters will stand in for vectors of parameters. Most of the time, upper case Latin characters such as \(X\) will represent random variables (which could have more than one dimension). Lower case letters from the Latin alphabet denote realised data, for instance \(x\) (which again could be multi-dimensional). Modified Latin alphabet letters denote computations performed on data, for instance \(\bar{x} = \frac{1}{n} \displaystyle\sum_{i} x_i\) where \(n\) is number of samples.

Ordinary least squares (OLS) regression can be used to estimate the parameters of certain types of model, most typically models of the form

\[ y = \beta_0 + \beta_1 \cdot x_1 + \beta_2 \cdot x_2 \]

This generic model says that the value of an outcome variable \(y\) is a linear function of one or more input predictor variables \(x_i\), where the \(x_i\) could be transforms of original data. But the above equation is a platonic ideal, what we call a data generating process (DGP). OLS allows us to recover estimates of the parameters of the model , i.e. to find \(\hat{\beta_i}\) and to enable us to write an estimated model:

\[ y = \hat{\beta_0} + \hat{\beta_1} \cdot x_1 + \hat{\beta_2} \cdot x_2 + \epsilon \]

This equation can also be expressed in matrix form as

\[ y = x'\cdot \hat{\beta} + \epsilon \]

where \(x' = (1, x_1, \dots, x_{n})'\) and \(\hat{\beta} = (\hat{\beta_0}, \hat{\beta_1}, \dots, \hat{\beta_{n}})\).

Given data \(y_i\) stacked to make a vector \(y\) and \(x_{i}\) stacked to make a matrix \(X\), this can be solved for the coefficients \(\hat{\beta}\) according to

\[ \hat{\beta} = \left(X'X\right)^{-1} X'y \]

Gauss-Markov Conditions#

To be sure that the estimates of these parameters are the best linear unbiased estimate, a few conditions need to hold: the Gauss-Markov conditions:

  1. \(y\) is a linear function of the \(\beta_i\)

  2. \(y\) and the \(x_i\) are randomly sampled from the population.

  3. There is no perfect multi-collinearity of variables.

  4. \(\mathbb{E}(\epsilon | x_1, \dots, x_n) = 0\) (unconfoundedness)

  5. \(\text{Var}(\epsilon | x_1, \dots, x_n) = \sigma^2\) (homoskedasticity)

(1)-(4) also guarantee that OLS estimates are unbiased and \(\mathbb{E}(\hat{\beta}_i) = \beta_i\).

The classic linear model requires a 6th assumption; that \(\epsilon \thicksim \mathcal{N}(0, \sigma^2)\).

The interpretation of regression coefficients depends on what their units are to begin with, but you can always work it out by differentiating both sides of the model equation with respect to the \(x_i\). For example, for the first model equation above

\[ \frac{\partial y}{\partial x_i} = \beta_i \]

so we get the interpretation that \(\beta_i\) is the rate of change of y with respect to \(x_i\). If \(x_i\) and \(y\) are in levels, this means that a unit increase in \(x_i\) is associated with a \(\beta_i\) units increase in \(y\). If the right-hand side of the model is \(\ln x_i\) then we get

\[ \frac{\partial y}{\partial x_i} = \beta_i \frac{1}{x_i} \]

with some abuse of notation, we can rewrite this as \(\partial y = \beta_i \partial x_i/x_i\), which says that a percent change in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\). With a logged \(y\) variable, it’s a percent change in \(x_i\) that is associated with a percent change in \(y\), or \(\partial y/y = \beta_i \partial x_i/x_i\) (note that both sides of this equation are unitless in this case). Finally, another example that is important in practice is that of log differences, eg \(y = \beta_i (\ln x_i - \ln x_i')\). Again, we will abuse notation and say that this case may be represented as \(\partial y = \beta_i (\partial x_i/x_i - \partial x_i'/x_i')\), i.e. the difference in two percentages, a percentage point change, in \(x_i\) is associated with a \(\beta_i\) unit change in \(y\).

Fixed vs Random Effects#

As this is going to come up in this chapter, it’s worth having a quick review of the difference between fixed and random effects in regression-based models. There’s a bit of disagreement about what people really mean when they say “fixed effects” or “random effects”, so let’s define them clearly here.

Random effects are estimated with partial pooling, while fixed effects are not.

What does this mean? Well, technically, if an effect is assumed to be a realised value of a random variable, it is called a random effect. We could say that, in frequentist land (where parameters have true values), a single fixed effect model might look like:

\[ y = \gamma\cdot d \]

where \(d \in \{0, 1\}\), which could be whether an individual belongs to, say, group 0 or group 1 and \(\gamma\) is a coefficient that is to be estimated. \(\gamma\) has a single “true” value, and we can admire the value of \(\hat{\gamma}\) in a regression table once we have done our estimation.

In contrast, even in a frequentist model, random effects are allowed to vary: they are drawn from a distribution (or, more accurately, their deviations from the true value are drawn from a distribution). This might be written as

\[ y_{i}= w_{i} \]

and it could be that, instead of a group-specific intercept as in the fixed effects model above, the intercept can take a range of values depending on the individual. Although it’s not always true, another way of thinking about this is that fixed effects are constant across individuals while random effects can vary.

So what’s this business about partial pooling? Imagine we have a situation where one group is under-sampled relative to the other groups. With few data points, we may not be able to precisely estimate fixed effects. However, partial pooling means that, if you do have few data points within a group, the group effect estimate can be partially based on the more abundant data from other groups. Random effects can provide a good compromise between estimating an effect by completely pooling all groups, which masks group-level variation, and estimating an effect for all groups completely separately, which could give imprecise estimates for low-sample groups. For example, when estimating means using a random effects approach, subgroup means are allowed to deviate a bit from the mean of the larger group, but not by an arbitrary amount—those deviations follow a distribution, typically a Gaussian (aka Normal) distribution. That’s where the “random” in random effects comes from: the deviations of subgroups from a “parent group” follow the distribution of a random variable.

The rule of thumb when thinking about how many different categories should exist for a variable to get the random, rather than fixed, effects treatment is that there should be more than 5 or 6 levels at a minimum in order to apply random effects—fewer than this and it’s hard to estimate the variance around the central estimate. Because you could potentially precisely estimated a fixed effect if you had lots of samples in those 5 or 6 levels, you’re more likely to reach for random effects when you have fewer samples per class.

Random effects are the extension of this partial pooling technique to a wide range of situations: including multiple predictors, mixed continuous and categorical variables, and more.

As a conrete example, taken from a forum post by Paul of MassMutual, suppose you want to estimate average US household income by ZIP code. You have a large dataset containing observations of household incomes and ZIP codes. Some ZIP codes are well represented in the dataset, but others have only a couple of households.

For your initial model you would most likely take the mean income in each ZIP. This will work well when you have lots of data for all ZIPs, but the estimates for your poorly sampled ZIPs will suffer from high variance. You can mitigate this by using a shrinkage estimator (aka partial pooling), which will push extreme values towards the mean income across all ZIP codes.

But how much shrinkage/pooling should you do for a particular ZIP? Intuitively, it should depend on the following:

  1. How many observations you have in that ZIP

  2. How many observations you have overall

  3. The individual-level mean and variance of household income across all ZIP codes

  4. The group-level variance in mean household income across all ZIP codes

Now, if you model ZIP code as a random effect, the mean income estimate in all ZIP codes will be subjected to a statistically well-founded shrinkage, taking into account all of these factors.

Random effects models automatically handle 4, the variability estimation, for all random effects in the model. This is useful, as it would be hard to do manually. Having accounted for (1)–(4), a random effects model is able to determine the appropriate shrinkage for low-sample groups. It can also handle much more complicated models with many different predictors.

Mixed effects models are models that combine random and fixed effects.

Imports#

Let’s import some of the packages we’ll be using:

import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
import statsmodels.formula.api as smf
import os
from pathlib import Path

Regression basics#

There are two ways to run regressions in statsmodels; passing the data directly as objects, and using formulae. We’ll see both but, just to get things started, let’s use the formula API.

We’ll use the starwars dataset to run a regression of mass on height for star wars characters. This example borrows very heavily from notes by Grant McDermott. First, let’s bring the dataset in:

df = pd.read_csv(
    "https://github.com/aeturrell/coding-for-economists/raw/main/data/starwars.csv",
    index_col=0,
)
# Look at first few rows
df.head()
name height mass hair_color eye_color gender homeworld species
0 Luke Skywalker 172.0 77.0 blond blue male Tatooine Human
1 C-3PO 167.0 75.0 NaN yellow NaN Tatooine Droid
2 R2-D2 96.0 32.0 NaN red NaN Naboo Droid
3 Darth Vader 202.0 136.0 none yellow male Tatooine Human
4 Leia Organa 150.0 49.0 brown brown female Alderaan Human

Okay, now let’s do a regression using OLS and a formula that says our y-variable is mass and our regressor is height:

results = smf.ols("mass ~ height", data=df).fit()

Well, where are the results!? They’re stored in the object we created. To peek at them we need to call the summary function (and, for easy reading, I’ll print it out too using print)

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   mass   R-squared:                       0.018
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     1.040
Date:                Wed, 02 Nov 2022   Prob (F-statistic):              0.312
Time:                        23:48:36   Log-Likelihood:                -385.50
No. Observations:                  59   AIC:                             775.0
Df Residuals:                      57   BIC:                             779.2
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -13.8103    111.155     -0.124      0.902    -236.393     208.773
height         0.6386      0.626      1.020      0.312      -0.615       1.892
==============================================================================
Omnibus:                      128.880   Durbin-Watson:                   2.025
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7330.447
Skew:                           7.340   Prob(JB):                         0.00
Kurtosis:                      55.596   Cond. No.                         895.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

What we’re seeing here are really several tables glued together. To just grab the coefficients in a tidy format, use

results.summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept -13.8103 111.155 -0.124 0.902 -236.393 208.773
height 0.6386 0.626 1.020 0.312 -0.615 1.892

You’ll have noticed that we got an intercept, even though we didn’t specify one in the formula. statsmodels adds in an intercept by default because, most of the time, you will want one. To turn it off, add a -1 at the end of the formula command, eg in this case you would call smf.ols('mass ~ height -1', data=df).fit().

The fit we got in the case with the intercept was pretty terrible; a low \(R^2\) and both of our confidence intervals are large and contain zero. What’s going on? If there’s one adage in regression that’s always worth paying attention to, it’s always plot your data. Let’s see what’s going on here:

fig, ax = plt.subplots()
sns.scatterplot(data=df, x="height", y="mass", s=200, ax=ax, legend=False, alpha=0.8)
ax.annotate(
    "Jabba the Hutt",
    df.iloc[df["mass"].idxmax()][["height", "mass"]],
    xytext=(0, -50),
    textcoords="offset points",
    arrowprops=dict(
        arrowstyle="fancy",
        color="k",
        connectionstyle="arc3,rad=0.3",
    ),
)
ax.set_ylim(0, None)
ax.set_title("Always plot the data", loc="left")
plt.show()
_images/econmt-regression_15_0.svg

Oh dear, Jabba’s been on the paddy frogs again, and he’s a bit of different case. When we’re estimating statistical relationships, we have all kinds of choices and should be wary about arbitrary decisions of what to include or exclude in case we fool ourselves about the generality of the relationship we are capturing. Let’s say we knew that we weren’t interested in Hutts though, but only in other species: in that case, it’s fair enough to filter out Jabba and run the regression without this obvious outlier. We’ll exclude any entry that contains the string ‘Jabba’ in the name column:

results_outlier_free = smf.ols(
    "mass ~ height", data=df[~df["name"].str.contains("Jabba")]
).fit()
print(results_outlier_free.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   mass   R-squared:                       0.580
Model:                            OLS   Adj. R-squared:                  0.572
Method:                 Least Squares   F-statistic:                     77.18
Date:                Wed, 02 Nov 2022   Prob (F-statistic):           4.02e-12
Time:                        23:48:36   Log-Likelihood:                -252.48
No. Observations:                  58   AIC:                             509.0
Df Residuals:                      56   BIC:                             513.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -32.5408     12.561     -2.591      0.012     -57.703      -7.379
height         0.6214      0.071      8.785      0.000       0.480       0.763
==============================================================================
Omnibus:                        9.951   Durbin-Watson:                   1.657
Prob(Omnibus):                  0.007   Jarque-Bera (JB):               10.072
Skew:                           0.793   Prob(JB):                      0.00650
Kurtosis:                       4.285   Cond. No.                         888.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This looks a lot more healthy. Not only is the model explaining a lot more of the data, but the coefficients are now significant.

Robust regression#

Filtering out data is one way to deal with outliers, but it’s not the only one; an alternative is to use a regression technique that is robust to such outliers. statsmodels has a variety of robust linear models that you can read more about here. To demonstrate the general idea, we will run the regression again but using a robust method.

results_robust = smf.rlm(
    "mass ~ height", data=df, M=sm.robust.norms.TrimmedMean(0.5)
).fit()
print(results_robust.summary())
                    Robust linear Model Regression Results                    
==============================================================================
Dep. Variable:                   mass   No. Observations:                   59
Model:                            RLM   Df Residuals:                       57
Method:                          IRLS   Df Model:                            1
Norm:                     TrimmedMean                                         
Scale Est.:                       mad                                         
Cov Type:                          H1                                         
Date:                Wed, 02 Nov 2022                                         
Time:                        23:48:37                                         
No. Iterations:                     7                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -31.4968      2.180    -14.451      0.000     -35.769     -27.225
height         0.6273      0.012     51.102      0.000       0.603       0.651
==============================================================================

If the model instance has been used for another fit with different fit parameters, then the fit options might not be the correct ones anymore .

There are many different ‘M-estimators’ available; in this case the TrimmedMean estimator gives a very similar result to the regression with the point excluded. We can visualise this, and, well, the results are not really very different in this case. Note that abline_plot just takes an intercept and coefficient from a fitted model and renders the line that they encode.

fig, ax = plt.subplots()
ax.scatter(df["height"], df["mass"])
sm.graphics.abline_plot(model_results=results_robust, ax=ax, alpha=0.5, label="Robust")
sm.graphics.abline_plot(
    model_results=results, ax=ax, color="red", label="OLS", alpha=0.5, ls="--"
)
ax.legend()
ax.set_xlabel("Height")
ax.set_ylabel("Mass")
ax.set_ylim(0, None)
plt.show()
_images/econmt-regression_22_0.svg

Standard errors#

You’ll have seen that there’s a column for the standard error of the estimates in the regression table and a message saying that the covariance type of these is ‘nonrobust’. Let’s say that, instead, we want to use Eicker-Huber-White robust standard errors, aka “HC2” standard errors. We can specify to use these up front standard errors up front in the fit method:

(smf.ols("mass ~ height", data=df).fit(cov_type="HC2").summary().tables[1])
coef std err z P>|z| [0.025 0.975]
Intercept -13.8103 23.456 -0.589 0.556 -59.782 32.162
height 0.6386 0.088 7.263 0.000 0.466 0.811

Or, alternatively, we can go back to our existing results and recompute the results from those:

print(results.get_robustcov_results("HC2").summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   mass   R-squared:                       0.018
Model:                            OLS   Adj. R-squared:                  0.001
Method:                 Least Squares   F-statistic:                     52.75
Date:                Wed, 02 Nov 2022   Prob (F-statistic):           1.16e-09
Time:                        23:48:37   Log-Likelihood:                -385.50
No. Observations:                  59   AIC:                             775.0
Df Residuals:                      57   BIC:                             779.2
Df Model:                           1                                         
Covariance Type:                  HC2                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -13.8103     23.456     -0.589      0.558     -60.779      33.159
height         0.6386      0.088      7.263      0.000       0.463       0.815
==============================================================================
Omnibus:                      128.880   Durbin-Watson:                   2.025
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             7330.447
Skew:                           7.340   Prob(JB):                         0.00
Kurtosis:                      55.596   Cond. No.                         895.
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC2)

There are several different types of standard errors available in statsmodels:

  • ‘HC0’, ‘HC1’, ‘HC2’, and ‘HC3’

  • ‘HAC’, for heteroskedasticity and autocorrelation consistent standard errors, for which you may want to also use some keyword arguments

  • ‘hac-groupsum’, for Driscoll and Kraay heteroscedasticity and autocorrelation robust standard errors in panel data, again for which you may have to specify extra keyword arguments

  • ‘hac-panel’, for heteroscedasticity and autocorrelation robust standard errors in panel data, again with keyword arguments; and

  • ‘cluster’ for clustered standard errors.

You can find information on all of these here. For more on standard errors in python, this is a good link.

For now, let’s look more closely at those last ones: clustered standard errors.

Clustered standard errors#

Often, we know something about the structure of likely errors, namely that they occur in groups. In the below example we use one-way clusters to capture this effect in the errors.

Note that in the below example, we grab a subset of the data for which a set of variables we’re interested in are defined, otherwise the below example would execute with an error because of missing cluster-group values.

xf = df.dropna(subset=["homeworld", "mass", "height", "species"])
results_clus = smf.ols("mass ~ height", data=xf).fit(
    cov_type="cluster", cov_kwds={"groups": xf["homeworld"]}
)
print(results_clus.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   mass   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                 -0.005
Method:                 Least Squares   F-statistic:                     39.44
Date:                Wed, 02 Nov 2022   Prob (F-statistic):           2.63e-07
Time:                        23:48:37   Log-Likelihood:                -361.23
No. Observations:                  55   AIC:                             726.5
Df Residuals:                      53   BIC:                             730.5
Df Model:                           1                                         
Covariance Type:              cluster                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -8.7955     29.150     -0.302      0.763     -65.929      48.338
height         0.6159      0.098      6.280      0.000       0.424       0.808
==============================================================================
Omnibus:                      121.086   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5943.317
Skew:                           7.095   Prob(JB):                         0.00
Kurtosis:                      51.909   Cond. No.                         958.
==============================================================================

Notes:
[1] Standard Errors are robust to cluster correlation (cluster)

We can add two-way clustering of standard errors using the following:

xf = df.dropna(subset=["homeworld", "mass", "height", "species"])
two_way_clusters = np.array(xf[["homeworld", "species"]], dtype=str)
results_clus = smf.ols("mass ~ height", data=xf).fit(
    cov_type="cluster", cov_kwds={"groups": two_way_clusters}
)
print(results_clus.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   mass   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                 -0.005
Method:                 Least Squares   F-statistic:                     35.10
Date:                Wed, 02 Nov 2022   Prob (F-statistic):           1.96e-06
Time:                        23:48:37   Log-Likelihood:                -361.23
No. Observations:                  55   AIC:                             726.5
Df Residuals:                      53   BIC:                             730.5
Df Model:                           1                                         
Covariance Type:              cluster                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -8.7955     30.636     -0.287      0.774     -68.841      51.250
height         0.6159      0.104      5.925      0.000       0.412       0.820
==============================================================================
Omnibus:                      121.086   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5943.317
Skew:                           7.095   Prob(JB):                         0.00
Kurtosis:                      51.909   Cond. No.                         958.
==============================================================================

Notes:
[1] Standard Errors are robust to cluster correlation (cluster)

As you would generally expect, the addition of clustering has increased the standard errors.

Fixed effects and categorical variables#

Fixed effects are a way of allowing the intercept of a regression model to vary freely across individuals or groups. It is, for example, used to control for any individual-specific attributes that do not vary across time in panel data.

Let’s use the ‘mtcars’ dataset to demonstrate this. We’ll read it in and set the datatypes of some of the columns at the same time.

mpg = pd.read_csv(
    "https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv",
    dtype={"model": str, "mpg": float, "hp": float, "disp": float, "cyl": "category"},
)
mpg.head()
model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110.0 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110.0 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93.0 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110.0 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175.0 3.15 3.440 17.02 0 0 3 2

Now we have our data in we want to regress mpg (miles per gallon) on hp (horsepower) with fixed effects for cyl (cylinders). Now we could just pop in a formula like this 'mpg ~ hp + cyl' because we took the trouble to declare that cyl was of datatype category when reading it in from the csv file. This means that statsmodels will treat it as a category and use it as a fixed effect by default.

But when I read that formula I get nervous that cyl might not have been processed correctly (ie it could have been read in as a float, which is what it looks like) and it might just be treated as a float (aka a continuous variable) in the regression. Which is not what we want at all. So, to be safe, and make our intentions explicit (even when the data is of type ‘category’), it’s best to use the syntax C(cyl) to ask for a fixed effect.

Here’s a regression which does that:

results_fe = smf.ols("mpg ~ hp + C(cyl)", data=mpg).fit()
print(results_fe.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.754
Model:                            OLS   Adj. R-squared:                  0.727
Method:                 Least Squares   F-statistic:                     28.59
Date:                Wed, 02 Nov 2022   Prob (F-statistic):           1.14e-08
Time:                        23:48:37   Log-Likelihood:                -79.948
No. Observations:                  32   AIC:                             167.9
Df Residuals:                      28   BIC:                             173.8
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept      28.6501      1.588     18.044      0.000      25.398      31.903
C(cyl)[T.6]    -5.9677      1.639     -3.640      0.001      -9.326      -2.610
C(cyl)[T.8]    -8.5209      2.326     -3.663      0.001     -13.286      -3.756
hp             -0.0240      0.015     -1.560      0.130      -0.056       0.008
==============================================================================
Omnibus:                        0.251   Durbin-Watson:                   1.667
Prob(Omnibus):                  0.882   Jarque-Bera (JB):                0.417
Skew:                           0.163   Prob(JB):                        0.812
Kurtosis:                       2.545   Cond. No.                         766.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

We can see here that two of the three possible values of cyl:

mpg["cyl"].unique()
['6', '4', '8']
Categories (3, object): ['4', '6', '8']

have been added as fixed effects regressors. The way that +C(cyl) has been added makes it so that the coefficients given are relative to the coefficient for the intercept. We can turn the intercept off to get a coefficient per unique cyl value:

print(smf.ols("mpg ~ hp + C(cyl) -1", data=mpg).fit().summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
C(cyl)[4]     28.6501      1.588     18.044      0.000      25.398      31.903
C(cyl)[6]     22.6825      2.228     10.180      0.000      18.119      27.246
C(cyl)[8]     20.1293      3.331      6.042      0.000      13.305      26.953
hp            -0.0240      0.015     -1.560      0.130      -0.056       0.008
==============================================================================

When there is an intercept, the coefficients of fixed effect variables can be interpreted as being the average of \(y\) for that class compared to the excluded classes holding all other categories and variables fixed.

High dimensional fixed effects, aka absorbing regression#

Sometimes, you just have a LOT of fixed effects (and perhaps you don’t particularly care about them individually). A common example is having a large number of firms as part of a panel. Fortunately, there are ways to make regression with high dimensional fixed effects be both fast and concise. (In Stata, this is provided by the reghdfe package.) Here, we will use the linearmodels package, which is built on top of statsmodels.

Let’s say we have a regression of the form

\[ y_i = x_i\cdot \beta + z_i\cdot \gamma +\epsilon_i \]

where \(y_i\) are observations indexed by \(i\), \(x_i\) are vectors of exogenous variables we care about the coefficients (\(\beta\)), \(z_i\) are vectors of fixed effects we don’t care too much about the coefficients (\gamma) for, and the \(\epsilon_i\) are errors. Then we can use an absorbing regression to solve for the \(\beta\) while ignoring the \(\gamma\).

Here’s an example using simulated data on workers taken from the linearmodels docs. Let’s simulate some data first, with two fixed effects (state and firm) alongside the two exogenous variables we’re interested in.

from numpy.random import default_rng

rng = default_rng()  # Random number generator

# Create synthetic input data
nobs = 1_000_000  # No. observations
state_id = rng.integers(50, size=nobs)  # State identifier
firm_id = rng.integers(nobs // 5, size=nobs)  # Firm identifier (mean of 5 workers/firm)
x = rng.standard_normal((nobs, 2))  # Exogenous variables
sim = pd.DataFrame(
    {
        "state_id": pd.Categorical(state_id),
        "firm_id": pd.Categorical(firm_id),
        "exog_0": x[:, 0],
        "exog_1": x[:, 1],
    }
)

# Create synthetic relationship
beta = [1, 3]  # coefficients of interest
state_effects = rng.standard_normal(state_id.max() + 1)
state_effects = state_effects[state_id]  # Generate state fixed effects
firm_effects = rng.standard_normal(firm_id.max() + 1)
firm_effects = firm_effects[firm_id]  # Generate firm fixed effects
eps = rng.standard_normal(nobs)  # Generate errors
# Generate endogeneous outcome variable
sim["y"] = (
    sim["exog_0"] * beta[0]
    + sim["exog_1"] * beta[1]
    + firm_effects
    + state_effects
    + eps
)
sim.head()
state_id firm_id exog_0 exog_1 y
0 29 192184 -1.000461 -1.591348 -5.984837
1 24 162326 -0.397737 -0.601988 -3.115233
2 19 170547 -0.069251 1.617722 6.198774
3 32 103096 0.346476 0.474450 -0.166212
4 43 124949 -1.113171 -2.451435 -12.128263

Now we pass this to linearmodels and with the state_id and firm_id variables entered via the absorb keyword argument:

from linearmodels.iv.absorbing import AbsorbingLS

mod = AbsorbingLS(
    sim["y"], sim[["exog_0", "exog_1"]], absorb=sim[["state_id", "firm_id"]]
)
print(mod.fit())
/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/linearmodels/iv/absorbing.py:710: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
  missing |= self._absorb_inter.cat.isnull().any(1).to_numpy()
/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/linearmodels/iv/absorbing.py:711: FutureWarning: In a future version of pandas all arguments of DataFrame.any and Series.any will be keyword-only.
  missing |= self._absorb_inter.cont.isnull().any(1).to_numpy()
                         Absorbing LS Estimation Summary                          
==================================================================================
Dep. Variable:                      y   R-squared:                          0.9385
Estimator:               Absorbing LS   Adj. R-squared:                     0.9233
No. Observations:             1000000   F-statistic:                     9.846e+06
Date:                Wed, Nov 02 2022   P-value (F-stat):                   0.0000
Time:                        23:48:43   Distribution:                      chi2(2)
Cov. Estimator:                robust   R-squared (No Effects):             0.9092
                                        Varaibles Absorbed:              1.987e+05
                             Parameter Estimates                              
==============================================================================
            Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
------------------------------------------------------------------------------
exog_0         1.0004     0.0010     991.35     0.0000      0.9984      1.0023
exog_1         3.0002     0.0010     2978.2     0.0000      2.9982      3.0021
==============================================================================

So, from our 1,000,000 observations, we have roughly 200,000 fixed effects that have been scooped up and packed away, leaving us with just the coefficients, \(\beta\), on the exogenous variables of interest.

Transformations of regressors#

This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of (potentially non-linear) transform of the original features \(x_i\).

Logs and arcsinh#

You have two options for adding in logs: do them before, or do them in the formula. Doing them before just makes use of standard dataframe operations to declare a new column:

mpg["lnhp"] = np.log(mpg["hp"])
print(smf.ols("mpg ~ lnhp", data=mpg).fit().summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     72.6405      6.004     12.098      0.000      60.378      84.903
lnhp         -10.7642      1.224     -8.792      0.000     -13.265      -8.264
==============================================================================

Alternatively, you can specify the log directly in the formula:

results_ln = smf.ols("mpg ~ np.log(hp)", data=mpg).fit()
print(results_ln.summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     72.6405      6.004     12.098      0.000      60.378      84.903
np.log(hp)   -10.7642      1.224     -8.792      0.000     -13.265      -8.264
==============================================================================

Clearly, the first method will work for arcsinh(x) and log(x+1), but you can also pass both of these into the formula directly too. (For more on the pros and cons of arcsinh, see Bellemare and Wichman [2020].) Here it is with arcsinh:

print(smf.ols("mpg ~ np.arcsinh(hp)", data=mpg).fit().summary().tables[1])
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         80.1041      6.850     11.694      0.000      66.115      94.094
np.arcsinh(hp)   -10.7646      1.224     -8.792      0.000     -13.265      -8.264
==================================================================================

Interaction terms and powers#

This chapter is showcasing linear regression. What that means is that the model is linear in the regressors: but it doesn’t mean that those regressors can’t be some kind of non-linear transform of the original features \(x_i\). Two of the most common transformations that you might want to use are interaction terms and polynomial terms. An example of an interaction term would be

\[ y = \beta_0 + \beta_1 x_1 \cdot x_2 \]

while an example of a polynomial term would be

\[ y = \beta_0 + \beta_1 x_1^2 \]

i.e. the last term enters only after it is multiplied by itself.

One note of warning: the interpretation of the effect of a variable is no longer as simple as was set out at the start of this chapter. To work out what the new interpretation is, the procedure is the same though: just take the derivative. In the case of the interaction model above, the effect of a unit change in \(x_1\) on \(y\) is now going to be a function of \(x_2\). In the case of the polynomial model above, the effect of a unit change in \(x_1\) on \(y\) will be \(2\beta_1 \cdot x_1\). For more on interaction terms, see Balli and Sørensen [2013].

Alright, with all of that preamble out of the way, let’s see how we actual do some of this! Let’s try including a linear and squared term in the regression of mpg on hp making use of the numpy power function:

res_poly = smf.ols("mpg ~ hp + np.power(hp, 2)", data=mpg).fit()
print(res_poly.summary().tables[1])
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          40.4091      2.741     14.744      0.000      34.804      46.015
hp                 -0.2133      0.035     -6.115      0.000      -0.285      -0.142
np.power(hp, 2)     0.0004   9.84e-05      4.275      0.000       0.000       0.001
===================================================================================

Now let’s include the original term in hp, a term in disp, and the interaction between them, which is represented by hp:disp in the table.

res_inter = smf.ols("mpg ~ hp * disp", data=mpg).fit()
print(res_inter.summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     39.6743      2.914     13.614      0.000      33.705      45.644
hp            -0.0979      0.025     -3.956      0.000      -0.149      -0.047
disp          -0.0734      0.014     -5.100      0.000      -0.103      -0.044
hp:disp        0.0003   8.69e-05      3.336      0.002       0.000       0.000
==============================================================================

In the unusual case that you want only the interaction term, you write it as it appears in the table above:

print(smf.ols("mpg ~ hp : disp", data=mpg).fit().summary().tables[1])
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     25.8099      1.005     25.677      0.000      23.757      27.863
hp:disp       -0.0001   1.91e-05     -7.409      0.000      -0.000      -0.000
==============================================================================

The formula API explained#

As you will have seen ~ separates the left- and right-hand sides of the regression. + computes a set union, which will also be familiar from the examples above (ie it inludes two terms as long as they are distinct). - computes a set difference; it adds the set of terms to the left of it while removing any that appear on the right of it. As we’ve seen, a*b is a short-hand for a + b + a:b, with the last term representing the interaction. / is short hand for a + a:b, which is useful if, for example b is nested within a, so it doesn’t make sense to control for b on its own. Actually, the : character can interact multiple terms so that (a + b):(d + c) is the same as a:c + a:d + b:c + b:d. C(a) tells statsmodels to treat a as a categorical variable that will be included as a fixed effect. Finally, as we saw above with powers, you can also pass in vectorised functions, such as np.log and np.power, directly into the formulae.

One gotcha with the formula API is ensuring that you have sensible variable names in your dataframe, i.e. ones that do not include whitespace or, to take a really pathological example, have the name ‘a + b’ for one of the columns that you want to regress on. You can dodge this kind of problem by passing in the variable name as, for example, Q("a + b") to be clear that the column name is anything within the Q("...").

Multiple regression models#

As is so often the case, you’re likely to want to run more than one model at once with different specifications. Although there is a base version of this in statsmodels, called summary_col, which you can find an example of here, instead we’ll be using the stargazer package to assemble the regressions together in a table.

In the above examples, we’ve collected a few different regression results. Let’s put them together:

from stargazer.stargazer import Stargazer


stargazer_tab = Stargazer([results_ln, res_poly, res_inter])
stargazer_tab
Dependent variable:mpg
(1)(2)(3)
Intercept72.640***40.409***39.674***
(6.004)(2.741)(2.914)
disp-0.073***
(0.014)
hp-0.213***-0.098***
(0.035)(0.025)
hp:disp0.000***
(0.000)
np.log(hp)-10.764***
(1.224)
np.power(hp, 2)0.000***
(0.000)
Observations323232
R20.7200.7560.820
Adjusted R20.7110.7390.801
Residual Std. Error3.239 (df=30)3.077 (df=29)2.692 (df=28)
F Statistic77.301*** (df=1; 30)44.953*** (df=2; 29)42.475*** (df=3; 28)
Note: *p<0.1; **p<0.05; ***p<0.01

There are lots of customisation options, including ones that add a title, rename variables, add notes, and so on. What is most useful is that as well as the HTML friendly output that you can see above, the package also exports to latex:

print(stargazer_tab.render_latex())
\begin{table}[!htbp] \centering
\begin{tabular}{@{\extracolsep{5pt}}lccc}
\\[-1.8ex]\hline
\hline \\[-1.8ex]
& \multicolumn{3}{c}{\textit{Dependent variable:}} \
\cr \cline{3-4}
\\[-1.8ex] & (1) & (2) & (3) \\
\hline \\[-1.8ex]
 Intercept & 72.640$^{***}$ & 40.409$^{***}$ & 39.674$^{***}$ \\
  & (6.004) & (2.741) & (2.914) \\
 disp & & & -0.073$^{***}$ \\
  & & & (0.014) \\
 hp & & -0.213$^{***}$ & -0.098$^{***}$ \\
  & & (0.035) & (0.025) \\
 hp:disp & & & 0.000$^{***}$ \\
  & & & (0.000) \\
 np.log(hp) & -10.764$^{***}$ & & \\
  & (1.224) & & \\
 np.power(hp, 2) & & 0.000$^{***}$ & \\
  & & (0.000) & \\
\hline \\[-1.8ex]
 Observations & 32 & 32 & 32 \\
 $R^2$ & 0.720 & 0.756 & 0.820 \\
 Adjusted $R^2$ & 0.711 & 0.739 & 0.801 \\
 Residual Std. Error & 3.239(df = 30) & 3.077(df = 29) & 2.692(df = 28)  \\
 F Statistic & 77.301$^{***}$ (df = 1.0; 30.0) & 44.953$^{***}$ (df = 2.0; 29.0) & 42.475$^{***}$ (df = 3.0; 28.0) \\
\hline
\hline \\[-1.8ex]
\textit{Note:} & \multicolumn{3}{r}{$^{*}$p$<$0.1; $^{**}$p$<$0.05; $^{***}$p$<$0.01} \\
\end{tabular}
\end{table}

And of course this can be written to a file using open('regression.tex', 'w').write(stargazer.render_latex()) where you can get your main latex compilation to scoop it up and use it.

Specifying regressions without formulae, using the array API#

As noted, there are two ways to run regressions in statsmodels; passing the data directly as objects, and using formulae. We’ve seen the formula API, now let’s see how to specify regressions using arrays with the format sm.OLS(y, X).

We will first need to take the data out of the pandas dataframe and put it into a couple of arrays. When we’re not using the formula API, the default is to treat the array X as the design matrix for the regression-so, if it doesn’t have a column of constants in, there will be no intercept in the regression. Therefore, we need to add a constant vector to the matrix X if we do want an intercept. Use sm.add_constant(X) for this.

X = np.array(xf["height"])
y = np.array(xf["mass"])
X = sm.add_constant(X)
results = sm.OLS(y, X).fit()
print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.014
Model:                            OLS   Adj. R-squared:                 -0.005
Method:                 Least Squares   F-statistic:                    0.7446
Date:                Wed, 02 Nov 2022   Prob (F-statistic):              0.392
Time:                        23:48:44   Log-Likelihood:                -361.23
No. Observations:                  55   AIC:                             726.5
Df Residuals:                      53   BIC:                             730.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -8.7955    127.191     -0.069      0.945    -263.909     246.318
x1             0.6159      0.714      0.863      0.392      -0.816       2.048
==============================================================================
Omnibus:                      121.086   Durbin-Watson:                   2.029
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             5943.317
Skew:                           7.095   Prob(JB):                         0.00
Kurtosis:                      51.909   Cond. No.                         958.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

This approach seems a lot less convenient, not to mention less clear, so you may be wondering when it is useful. It’s useful when you want to do many regressions in a systematic way or when you don’t know what the columns of a dataset will be called ahead of time. It can actually be a little bit simpler to specify for more complex regressions too.

Fixed effects in the array API#

If you’re using the formula API, it’s easy to turn a regressor x into a fixed effect by putting C(x) into the model formula, as you’ll see in the next section.

For the array API, things are not that simple and you need to use dummy variables. Let’s say we have some data like this:

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(seed_for_prng)  # prng=probabilistic random number generator

no_obs = 200
X = pd.DataFrame(prng.normal(size=no_obs))
X[1] = prng.choice(["a", "b"], size=no_obs)
# Get this a numpy array
X = X.values
# Create the y data, adding in a bit of noise
y = X[:, 0] * 2 + 0.5 + prng.normal(scale=0.1, size=no_obs)
y = [el_y + 1.5 if el_x == "a" else el_y + 3.4 for el_y, el_x in zip(y, X[:, 1])]
X[:5, :]
array([[0.927388713624307, 'b'],
       [-0.05975401040013389, 'a'],
       [1.0437358922863453, 'a'],
       [1.5056188793532617, 'b'],
       [0.10762743224577262, 'a']], dtype=object)

The first feature (column) is of numbers and it’s clear how we include it. The second, however, is a grouping that we’d like to include as a fixed effect. But if we just throw this matrix into sm.OLS(y, X), we’re going to get trouble because statsmodels isn’t sure what to do with a vector of strings. So, instead, we need to create some dummy variables out of our second column of data

Astonishingly, there are several popular ways to create dummy variables in Python: scikit-learn’s OneHotEncoder and pandasget_dummies being my favourites. Let’s use the latter here.

pd.get_dummies(X[:, 1])
a b
0 0 1
1 1 0
2 1 0
... ... ...
197 1 0
198 1 0
199 0 1

200 rows × 2 columns

We just need to pop this into our matrix \(X\):

X = np.column_stack([X[:, 0], pd.get_dummies(X[:, 1])])
X = np.array(X, dtype=float)
X[:5, :]
array([[ 0.92738871,  0.        ,  1.        ],
       [-0.05975401,  1.        ,  0.        ],
       [ 1.04373589,  1.        ,  0.        ],
       [ 1.50561888,  0.        ,  1.        ],
       [ 0.10762743,  1.        ,  0.        ]])

Okay, so now we’re ready to do our regression:

print(sm.OLS(y, X).fit().summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.998
Model:                            OLS   Adj. R-squared:                  0.998
Method:                 Least Squares   F-statistic:                 4.302e+04
Date:                Wed, 02 Nov 2022   Prob (F-statistic):          6.90e-261
Time:                        23:48:44   Log-Likelihood:                 165.45
No. Observations:                 200   AIC:                            -324.9
Df Residuals:                     197   BIC:                            -315.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             1.9893      0.008    263.045      0.000       1.974       2.004
x2             1.9924      0.011    176.203      0.000       1.970       2.015
x3             3.8977      0.010    384.415      0.000       3.878       3.918
==============================================================================
Omnibus:                        2.096   Durbin-Watson:                   2.116
Prob(Omnibus):                  0.351   Jarque-Bera (JB):                2.017
Skew:                           0.245   Prob(JB):                        0.365
Kurtosis:                       2.955   Cond. No.                         1.50
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Perhaps you can see why I generally prefer the formula API…

Instrumental variables#

Rather than use statsmodels for IV, we’ll use the linearmodels package, which has very clean documentation (indeed, this sub-section is indebted to that documentation).

Recall that a good instrumental variable \(z\) has zero covariance with the error from the regression (which is untestable) and non-zero covariance with the variable of interest (which is).

Recall that in IV regression, we have a model of the form

\[\begin{split} \begin{split}y_i & = x_{1i}\hat{\beta_1} + x_{2i}\hat{\beta_2} + \epsilon_i \\ x_{2i} & = z_{1i}\hat{\delta} + z_{2i}\hat{\gamma} + \nu_i\end{split} \end{split}\]

where \(x_{1i}\) is a set of \(k_1\) exogenous regressors and \(x_{2i}\) is a set of \(k_2\) endogenous regressors such that \(\text{Cov}(x_{2i}, \epsilon_i)\neq 0\). This is a problem for the usual OLS assumptions (the right-hand side should be exogenous).

To get around this, in 2-stage least squares IV, we first regress \(x_{2i}\) on instruments that explain \(x_{2i}\) but not \(y_i\), and then regress \(y_i\) only on the predicted/estimated left-hand side from the first regression, ie on \(\hat{x_{2i}}\). There are other estimators than IV2SLS, but I think that one has the most intuitive explanation of what’s going.

As well as a 2-stage least squares estimator called IV2SLS, linearmodels has a Limited Information Maximum Likelihood (LIML) estimator IVLIML, a Generalized Method of Moments (GMM) estimator IVGMM, and a Generalized Method of Moments using the Continuously Updating Estimator (CUE) IVGMMCUE.

Just as with OLS via statsmodels, there’s an option to use an array API for the linearmodels IV methods.

It’s always easiest to see an example, so let’s estimate what might cause (realised) cigarette demand for the 48 continental US states in 1995 with IV2SLS. First we need to import the estimator, IV2SLS, and the data:

from linearmodels.iv import IV2SLS

df = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/AER/CigarettesSW.csv",
    dtype={"state": "category", "year": "category"},
).assign(
    rprice=lambda x: x["price"] / x["cpi"],
    rincome=lambda x: x["income"] / x["population"] / x["cpi"],
)
df.head()
Unnamed: 0 state year cpi population packs income tax price taxs rprice rincome
0 1 AL 1985 1.076 3973000.0 116.486282 46014968 32.500004 102.181671 33.348335 94.964381 10.763866
1 2 AR 1985 1.076 2327000.0 128.534592 26210736 37.000000 101.474998 37.000000 94.307622 10.468165
2 3 AZ 1985 1.076 3184000.0 104.522614 43956936 31.000000 108.578751 36.170418 100.909622 12.830456
3 4 CA 1985 1.076 26444000.0 100.363037 447102816 26.000000 107.837341 32.104000 100.220580 15.713321
4 5 CO 1985 1.076 3209000.0 112.963539 49466672 31.000000 94.266663 31.000000 87.608425 14.326190

Now we’ll specify the model. It’s going to be in the form dep ~ exog + [endog ~ instruments], where endog will be regressed on instruments and dep will be regressed on both exog and the predicted values of endog.

In this case, the model will be

\[ \text{Price}_i = \hat{\pi_0} + \hat{\pi_1} \text{SalesTax}_i + v_i \]

in the first stage regression and

\[ \text{Packs}_i = \hat{\beta_0} + \hat{\beta_2}\widehat{\text{Price}_i} + \hat{\beta_1} \text{RealIncome}_i + u_i \]

in the second stage.

results_iv2sls = IV2SLS.from_formula(
    "np.log(packs) ~ 1 + np.log(rincome) + C(year) + C(state) + [np.log(rprice) ~ taxs]",
    df,
).fit(cov_type="clustered", clusters=df["year"])
print(results_iv2sls.summary)
                          IV-2SLS Estimation Summary                          
==============================================================================
Dep. Variable:          np.log(packs)   R-squared:                      0.9659
Estimator:                    IV-2SLS   Adj. R-squared:                 0.9279
No. Observations:                  96   F-statistic:                  1.23e+17
Date:                Wed, Nov 02 2022   P-value (F-stat)                0.0000
Time:                        23:48:44   Distribution:                 chi2(50)
Cov. Estimator:             clustered                                         
                                                                              
                                Parameter Estimates                                
===================================================================================
                 Parameter  Std. Err.     T-stat    P-value    Lower CI    Upper CI
-----------------------------------------------------------------------------------
Intercept           9.4924     0.0263     360.24     0.0000      9.4407      9.5440
C(state)[T.AR]      0.1770     0.0531     3.3338     0.0009      0.0729      0.2810
C(state)[T.AZ]     -0.0899     0.0132    -6.8132     0.0000     -0.1158     -0.0640
C(state)[T.CA]     -0.2781     0.0214    -12.996     0.0000     -0.3200     -0.2361
C(state)[T.CO]     -0.2479     0.0090    -27.625     0.0000     -0.2655     -0.2303
C(state)[T.CT]     -0.0171     0.0196    -0.8720     0.3832     -0.0556      0.0213
C(state)[T.DE]      0.1110     0.0291     3.8105     0.0001      0.0539      0.1682
C(state)[T.FL]      0.0762     0.0142     5.3596     0.0000      0.0483      0.1041
C(state)[T.GA]     -0.0695     0.0251    -2.7706     0.0056     -0.1186     -0.0203
C(state)[T.IA]      0.0120     0.0739     0.1629     0.8706     -0.1328      0.1569
C(state)[T.ID]     -0.1272     0.0077    -16.597     0.0000     -0.1423     -0.1122
C(state)[T.IL]     -0.0339     0.0081    -4.1912     0.0000     -0.0497     -0.0180
C(state)[T.IN]      0.1198     0.0611     1.9609     0.0499   5.573e-05      0.2395
C(state)[T.KS]     -0.0910     0.0305    -2.9884     0.0028     -0.1507     -0.0313
C(state)[T.KY]      0.3525     0.0631     5.5906     0.0000      0.2289      0.4761
C(state)[T.LA]      0.1315     0.0104     12.664     0.0000      0.1112      0.1519
C(state)[T.MA]     -0.0403     0.0069    -5.8826     0.0000     -0.0538     -0.0269
C(state)[T.MD]     -0.2322     0.0239    -9.7376     0.0000     -0.2790     -0.1855
C(state)[T.ME]      0.2008     0.0574     3.5011     0.0005      0.0884      0.3133
C(state)[T.MI]      0.1268     0.0745     1.7009     0.0890     -0.0193      0.2728
C(state)[T.MN]      0.0568     0.0490     1.1595     0.2463     -0.0392      0.1529
C(state)[T.MO]      0.0640     0.0476     1.3454     0.1785     -0.0292      0.1572
C(state)[T.MS]      0.1501     0.0272     5.5267     0.0000      0.0969      0.2034
C(state)[T.MT]     -0.1522     0.0054    -28.250     0.0000     -0.1627     -0.1416
C(state)[T.NC]      0.0396     0.0191     2.0655     0.0389      0.0020      0.0771
C(state)[T.ND]     -0.0311     0.0399    -0.7787     0.4361     -0.1092      0.0471
C(state)[T.NE]     -0.0741     0.0375    -1.9765     0.0481     -0.1476     -0.0006
C(state)[T.NH]      0.3504     0.0315     11.114     0.0000      0.2886      0.4122
C(state)[T.NJ]     -0.0873  6.107e-05    -1429.3     0.0000     -0.0874     -0.0872
C(state)[T.NM]     -0.2858     0.0040    -71.049     0.0000     -0.2937     -0.2779
C(state)[T.NV]      0.1789     0.0259     6.9075     0.0000      0.1281      0.2296
C(state)[T.NY]     -0.0719     0.0032    -22.256     0.0000     -0.0782     -0.0655
C(state)[T.OH]      0.0325     0.0402     0.8088     0.4186     -0.0463      0.1114
C(state)[T.OK]      0.0946     0.0538     1.7572     0.0789     -0.0109      0.2000
C(state)[T.OR]     -0.0153     0.0673    -0.2269     0.8205     -0.1471      0.1166
C(state)[T.PA]     -0.0031     0.0006    -4.8401     0.0000     -0.0044     -0.0019
C(state)[T.RI]      0.1394     0.0921     1.5136     0.1301     -0.0411      0.3200
C(state)[T.SC]     -0.0212     0.0334    -0.6345     0.5257     -0.0866      0.0442
C(state)[T.SD]     -0.0675     0.0711    -0.9488     0.3427     -0.2069      0.0719
C(state)[T.TN]      0.1473     0.0470     3.1340     0.0017      0.0552      0.2394
C(state)[T.TX]     -0.0579     0.0136    -4.2560     0.0000     -0.0845     -0.0312
C(state)[T.UT]     -0.4899     0.0276    -17.776     0.0000     -0.5440     -0.4359
C(state)[T.VA]     -0.0559     0.0471    -1.1875     0.2350     -0.1482      0.0364
C(state)[T.VT]      0.2209     0.0467     4.7267     0.0000      0.1293      0.3125
C(state)[T.WA]      0.0064     0.0011     6.0151     0.0000      0.0043      0.0085
C(state)[T.WI]      0.0741     0.0590     1.2569     0.2088     -0.0415      0.1897
C(state)[T.WV]      0.1576     0.0582     2.7097     0.0067      0.0436      0.2716
C(state)[T.WY]     -0.0169     0.0590    -0.2858     0.7750     -0.1325      0.0988
C(year)[T.1995]    -0.0328  3.591e-09 -9.137e+06     0.0000     -0.0328     -0.0328
np.log(rincome)     0.4434                                                         
np.log(rprice)     -1.2793  9.692e-09  -1.32e+08     0.0000     -1.2793     -1.2793
===================================================================================

Endogenous: np.log(rprice)
Instruments: taxs
Clustered Covariance (One-Way)
Debiased: False
Num Clusters: 2
/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/linearmodels/iv/results.py:180: RuntimeWarning: invalid value encountered in sqrt
  std_errors = sqrt(diag(self.cov))

We sort of skipped a step here and did everything all in one go. If we did want to know how our first stage regression went, we can just pass a formula to IV2SLS without the part in square brackets, [...], and it will run regular OLS.

But, in this case, there’s an easier way: we can print out a set of handy 1st stage statistics from running the full model.

print(results_iv2sls.first_stage)
      First Stage Estimation Results     
=========================================
                           np.log(rprice)
-----------------------------------------
R-squared                          0.9760
Partial R-squared                  0.7070
Shea's R-squared                   0.7070
Partial F-statistic             1.568e+18
P-value (Partial F-stat)           0.0000
Partial F-stat Distn              chi2(1)
========================== ==============
Intercept                          4.4375
                                 (410.34)
C(state)[T.AR]                    -0.0190
                                (-3.5717)
C(state)[T.AZ]                     0.0552
                                 (20.534)
C(state)[T.CA]                     0.0969
                                 (5.3807)
C(state)[T.CO]                    -0.0044
                                (-0.1117)
C(state)[T.CT]                     0.1269
                                 (9.0641)
C(state)[T.DE]                     0.0254
                                 (7.9851)
C(state)[T.FL]                     0.0598
                                 (4.0067)
C(state)[T.GA]                    -0.0049
                                (-0.3455)
C(state)[T.IA]                     0.0135
                                 (0.5305)
C(state)[T.ID]                     0.0297
                                 (79.832)
C(state)[T.IL]                     0.0476
                                 (12.008)
C(state)[T.IN]                    -0.0460
                                (-76.127)
C(state)[T.KS]                    -0.0019
                                (-0.0855)
C(state)[T.KY]                    -0.0783
                                (-3.6663)
C(state)[T.LA]                     0.0298
                                 (2.6211)
C(state)[T.MA]                     0.0816
                                 (5.3671)
C(state)[T.MD]                    -0.0199
                                (-0.5429)
C(state)[T.ME]                     0.0370
                                 (1.8744)
C(state)[T.MI]                     0.0363
                                 (1.9899)
C(state)[T.MN]                     0.0932
                                 (8.6047)
C(state)[T.MO]                    -0.0056
                                (-12.033)
C(state)[T.MS]                     0.0147
                                 (2.1430)
C(state)[T.MT]                    -0.0183
                                (-9.6530)
C(state)[T.NC]                    -0.0672
                                (-1.8041)
C(state)[T.ND]                     0.0130
                                 (1.8765)
C(state)[T.NE]                     0.0105
                                 (2.4846)
C(state)[T.NH]                    -0.0175
                                (-0.6334)
C(state)[T.NJ]                     0.0694
                                 (5.2527)
C(state)[T.NM]                     0.0301
                                 (2.6464)
C(state)[T.NV]                     0.1075
                                 (16.434)
C(state)[T.NY]                     0.0847
                                 (5.1334)
C(state)[T.OH]                    -0.0197
                                (-22.592)
C(state)[T.OK]                    -0.0035
                                (-0.7109)
C(state)[T.OR]                     0.0146
                                 (0.2944)
C(state)[T.PA]                     0.0152
                                 (0.9986)
C(state)[T.RI]                     0.0249
                                 (0.5964)
C(state)[T.SC]                    -0.0530
                                (-2.1043)
C(state)[T.SD]                    -0.0190
                                (-1.4001)
C(state)[T.TN]                     0.0051
                                 (0.5517)
C(state)[T.TX]                     0.0378
                                 (4.0760)
C(state)[T.UT]                     0.0577
                                 (5.8788)
C(state)[T.VA]                     0.0265
                                 (0.5400)
C(state)[T.VT]                     0.0228
                                 (1.0969)
C(state)[T.WA]                     0.1554
                                 (13.780)
C(state)[T.WI]                     0.0723
                                 (5.1954)
C(state)[T.WV]                     0.0205
                                 (1.7212)
C(state)[T.WY]                    -0.0005
                                (-0.0220)
C(year)[T.1995]                    0.0822
                              (1.799e+07)
np.log(rincome)                   -0.0296
                             (-1.675e+06)
taxs                               0.0051
                              (1.252e+09)
-----------------------------------------

T-stats reported in parentheses
T-stats use same covariance type as original model

There are more tests and checks available. For example, Wooldridge’s regression test of exogeneity uses regression residuals from the endogenous variables regressed on the exogenous variables and the instrument to test for endogenity and is available to run on fitted model results. Let’s check that:

results_iv2sls.wooldridge_regression
Wooldridge's regression test of exogeneity
H0: Endogenous variables are exogenous
Statistic: -24963534146713064.0000
P-value: 1.0000
Distributed: chi2(1)
WaldTestStatistic, id: 0x7fa2cff16160

We can compare the IV results against (naive) OLS. First, run the OLS equivalent:

res_cig_ols = IV2SLS.from_formula(
    "np.log(packs) ~ 1 + np.log(rincome) + C(year) + C(state) + np.log(rprice)", df
).fit(cov_type="clustered", clusters=df["year"])

Now select these two models to compare:

from collections import OrderedDict
from linearmodels.iv.results import compare

res = OrderedDict()
res["OLS"] = res_cig_ols
res["2SLS"] = results_iv2sls

print(compare(res))
                    Model Comparison                    
========================================================
                                   OLS              2SLS
--------------------------------------------------------
Dep. Variable            np.log(packs)     np.log(packs)
Estimator                          OLS           IV-2SLS
No. Observations                    96                96
Cov. Est.                    clustered         clustered
R-squared                       0.9675            0.9659
Adj. R-squared                  0.9313            0.9279
F-statistic                 -7.751e+16          1.23e+17
P-value (F-stat)                1.0000            0.0000
==================     ===============   ===============
Intercept                       8.3597            9.4924
                              (484.80)          (360.24)
C(state)[T.AR]                  0.1686            0.1770
                              (3.7460)          (3.3338)
C(state)[T.AZ]                 -0.1280           -0.0899
                             (-49.908)         (-6.8132)
C(state)[T.CA]                 -0.3320           -0.2781
                             (-9.0787)         (-12.996)
C(state)[T.CO]                 -0.2591           -0.2479
                             (-345.21)         (-27.625)
C(state)[T.CT]                 -0.1042           -0.0171
                             (-3.8346)         (-0.8720)
C(state)[T.DE]                  0.0900            0.1110
                              (3.3042)          (3.8105)
C(state)[T.FL]                  0.0325            0.0762
                              (1.9171)          (5.3596)
C(state)[T.GA]                 -0.0691           -0.0695
                             (-2.4532)         (-2.7706)
C(state)[T.IA]                 -0.0143            0.0120
                             (-0.2380)          (0.1629)
C(state)[T.ID]                 -0.1422           -0.1272
                             (-8.4099)         (-16.597)
C(state)[T.IL]                 -0.0766           -0.0339
                             (-10.214)         (-4.1912)
C(state)[T.IN]                  0.1231            0.1198
                              (2.0976)          (1.9609)
C(state)[T.KS]                 -0.1073           -0.0910
                             (-4.8883)         (-2.9884)
C(state)[T.KY]                  0.3803            0.3525
                              (6.6476)          (5.5906)
C(state)[T.LA]                  0.1174            0.1315
                              (12.648)          (12.664)
C(state)[T.MA]                 -0.1055           -0.0403
                             (-10.125)         (-5.8826)
C(state)[T.MD]                 -0.2572           -0.2322
                             (-62.161)         (-9.7376)
C(state)[T.ME]                  0.1682            0.2008
                              (3.7655)          (3.5011)
C(state)[T.MI]                  0.0652            0.1268
                              (1.4699)          (1.7009)
C(state)[T.MN]                 -0.0051            0.0568
                             (-0.1580)          (1.1595)
C(state)[T.MO]                  0.0600            0.0640
                              (1.2722)          (1.3454)
C(state)[T.MS]                  0.1472            0.1501
                              (6.1510)          (5.5267)
C(state)[T.MT]                 -0.1469           -0.1522
                             (-26.043)         (-28.250)
C(state)[T.NC]                  0.0609            0.0396
                              (7.1838)          (2.0655)
C(state)[T.ND]                 -0.0595           -0.0311
                             (-1.9467)         (-0.7787)
C(state)[T.NE]                 -0.1002           -0.0741
                             (-3.4255)         (-1.9765)
C(state)[T.NH]                  0.3374            0.3504
                              (14.331)          (11.114)
C(state)[T.NJ]                 -0.1458           -0.0873
                             (-11.223)         (-1429.3)
C(state)[T.NM]                 -0.2981           -0.2858
                             (-28.465)         (-71.049)
C(state)[T.NV]                  0.1219            0.1789
                              (3.2128)          (6.9075)
C(state)[T.NY]                 -0.1369           -0.0719
                             (-5.8164)         (-22.256)
C(state)[T.OH]                  0.0196            0.0325
                              (0.5379)          (0.8088)
C(state)[T.OK]                  0.0841            0.0946
                              (1.6659)          (1.7572)
C(state)[T.OR]                 -0.0380           -0.0153
                             (-0.7768)         (-0.2269)
C(state)[T.PA]                 -0.0333           -0.0031
                             (-16.905)         (-4.8401)
C(state)[T.RI]                  0.0900            0.1394
                              (1.3979)          (1.5136)
C(state)[T.SC]                 -0.0037           -0.0212
                             (-0.1369)         (-0.6345)
C(state)[T.SD]                 -0.0695           -0.0675
                             (-1.1052)         (-0.9488)
C(state)[T.TN]                  0.1373            0.1473
                              (3.2924)          (3.1340)
C(state)[T.TX]                 -0.0964           -0.0579
                             (-3.8212)         (-4.2560)
C(state)[T.UT]                 -0.5127           -0.4899
                             (-16.600)         (-17.776)
C(state)[T.VA]                 -0.0628           -0.0559
                             (-1.7642)         (-1.1875)
C(state)[T.VT]                  0.2053            0.2209
                              (5.3662)          (4.7267)
C(state)[T.WA]                 -0.0777            0.0064
                             (-6.5481)          (6.0151)
C(state)[T.WI]                  0.0260            0.0741
                              (0.5209)          (1.2569)
C(state)[T.WV]                  0.1490            0.1576
                              (2.4854)          (2.7097)
C(state)[T.WY]                 -0.0150           -0.0169
                             (-0.2771)         (-0.2858)
C(year)[T.1995]                -0.0885           -0.0328
                          (-2.569e+07)      (-9.137e+06)
np.log(rincome)                 0.4974            0.4434
                                                        
np.log(rprice)                 -1.0560           -1.2793
                          (-9.916e+07)       (-1.32e+08)
==================== ================= =================
Instruments                                         taxs
--------------------------------------------------------

T-stats reported in parentheses
/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/linearmodels/iv/results.py:180: RuntimeWarning: invalid value encountered in sqrt
  std_errors = sqrt(diag(self.cov))

Once we take into account the fact that the real price is endogeneous to (realised) demand, we find that its coefficient is more negative; i.e. an increase in the real price of cigarettes creates a bigger fall in number of packs bought.

Logit, probit, and generalised linear models#

Logit#

A logistical regression, aka a logit, is a statistical method for a best-fit line between a regressors \(X\) and an outcome varibale \(y\) that takes on values in \((0, 1)\).

The function that we’re assuming links the regressors and the outcome has a few different names but the most common is the sigmoid function or the logistic function. The data generating process is assumed to be

\[ {\displaystyle \mathbb{P}(Y=1\mid X) = \frac{1}{1 + e^{-X'\beta}}} \]

we can also write this as \(\ln\left(\frac{p}{p-1}\right) = \beta_0 + \sum_i \beta_i x_i\) to get a ‘log-odds’ relationship. The coefficients from a logit model do not have the same interpration as in an OLS estimation, and you can see this from the fact that \(\partial y/\partial x_i \neq \beta_i\) for logit. Of course, you can work out what the partial derivative is for yourself but most packages offer a convenient way to quickly recover the marginal effects.

Logit models are available in scikit-learn and statsmodels but bear in mind that the scikit-learn logit model is, ermm, extremely courageous in that regularisation is applied by default. If you don’t know what that means, don’t worry, but it’s probably best to stick with statsmodels as we will do in this example.

We will predict a target GRADE, representing whether a grade improved or not, based on some regressors including participation in a programme.

# Load the data from Spector and Mazzeo (1980)
df = sm.datasets.spector.load_pandas().data
# Look at info on data
print(sm.datasets.spector.NOTE)
::

    Number of Observations - 32

    Number of Variables - 4

    Variable name definitions::

        Grade - binary variable indicating whether or not a student's grade
                improved.  1 indicates an improvement.
        TUCE  - Test score on economics test
        PSI   - participation in program
        GPA   - Student's grade point average
res_logit = smf.logit("GRADE ~ GPA + TUCE + PSI", data=df).fit()
print(res_logit.summary())
Optimization terminated successfully.
         Current function value: 0.402801
         Iterations 7
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                  GRADE   No. Observations:                   32
Model:                          Logit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Wed, 02 Nov 2022   Pseudo R-squ.:                  0.3740
Time:                        23:48:44   Log-Likelihood:                -12.890
converged:                       True   LL-Null:                       -20.592
Covariance Type:            nonrobust   LLR p-value:                  0.001502
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -13.0213      4.931     -2.641      0.008     -22.687      -3.356
GPA            2.8261      1.263      2.238      0.025       0.351       5.301
TUCE           0.0952      0.142      0.672      0.501      -0.182       0.373
PSI            2.3787      1.065      2.234      0.025       0.292       4.465
==============================================================================

So, did participation (PSI) help increase a grade? Yes. But we need to check the marginal effect to say exactly how much. We’ll use get_margeff to do this, we’d like the \(dy/dx\) effect, and we’ll take it at the mean of each regressor.

marg_effect = res_logit.get_margeff(at="mean", method="dydx")
marg_effect.summary()
Logit Marginal Effects
Dep. Variable: GRADE
Method: dydx
At: mean
dy/dx std err z P>|z| [0.025 0.975]
GPA 0.5339 0.237 2.252 0.024 0.069 0.998
TUCE 0.0180 0.026 0.685 0.493 -0.033 0.069
PSI 0.4493 0.197 2.284 0.022 0.064 0.835

So participation gives almost half a grade increase.

Probit#

Probit is very similar to logit: it’s a statistical method for a best-fit line between regressors \(X\) and an outcome varibale \(y\) that takes on values in \((0, 1)\). And, just like with logit, the function that we’re assuming links the regressors and the outcome has a few different names!

The data generating process is assumed to be

\[ {\displaystyle \mathbb{P}(Y=1\mid X)=\Phi (X^{T}\beta )} \]

where

\[ {\displaystyle \Phi (x)={\frac {1}{\sqrt {2\pi }}}\int _{-\infty }^{x}e^{-{\frac {y^{2}}{2}}}dy.} \]

is the cumulative standard normal (aka Gaussian) distribution. The coefficients from a probit model do not have the same interpration as in an OLS estimation, and you can see this from the fact that \(\partial y/\partial x_i \neq \beta_i\) for probit. And, just as with logit, although you can derive the marginal effects, most packages offer a convenient way to quickly recover them.

We can re-use our previous example of predicting a target GRADE, representing whether a grade improved or not, based on some regressors including participation (PSI) in a programme.

res_probit = smf.probit("GRADE ~ GPA + TUCE + PSI", data=df).fit()
print(res_probit.summary())
Optimization terminated successfully.
         Current function value: 0.400588
         Iterations 6
                          Probit Regression Results                           
==============================================================================
Dep. Variable:                  GRADE   No. Observations:                   32
Model:                         Probit   Df Residuals:                       28
Method:                           MLE   Df Model:                            3
Date:                Wed, 02 Nov 2022   Pseudo R-squ.:                  0.3775
Time:                        23:48:44   Log-Likelihood:                -12.819
converged:                       True   LL-Null:                       -20.592
Covariance Type:            nonrobust   LLR p-value:                  0.001405
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     -7.4523      2.542     -2.931      0.003     -12.435      -2.469
GPA            1.6258      0.694      2.343      0.019       0.266       2.986
TUCE           0.0517      0.084      0.617      0.537      -0.113       0.216
PSI            1.4263      0.595      2.397      0.017       0.260       2.593
==============================================================================
p_marg_effect = res_probit.get_margeff(at="mean", method="dydx")
p_marg_effect.summary()
Probit Marginal Effects
Dep. Variable: GRADE
Method: dydx
At: mean
dy/dx std err z P>|z| [0.025 0.975]
GPA 0.5333 0.232 2.294 0.022 0.078 0.989
TUCE 0.0170 0.027 0.626 0.531 -0.036 0.070
PSI 0.4679 0.188 2.494 0.013 0.100 0.836

It’s no coincidence that we find very similar results here because the two functions we’re using don’t actually look all that different:

import scipy.stats as st

fig, ax = plt.subplots()
support = np.linspace(-6, 6, 1000)
ax.plot(support, st.logistic.cdf(support), ls="--", label="Logistic")
ax.plot(support, st.norm.cdf(support), label="Probit")
ax.legend()
ax.set_ylim(0, None)
ax.set_ylim(0, None)
plt.show()
_images/econmt-regression_101_0.svg

What difference there is, is that logistic regression puts more weight into the tails of the distribution. Arguably, logit is easier to interpret too. With logistic regression, a one unit change in \(x_i\) is associated with a \(\beta_i\) change in the log odds of a 1 outcome or, alternatively, an \(e^{\beta_i}\)-fold change in the odds, all else being equal. With a probit, this is a change of \(\beta_i z\) for \(z\) a normalised variable that you’d have to convert into a predicted probability using the normal CDF.

Generalised linear models#

Logit and probit (and OLS for that matter) as special cases of a class of models such that \(g\) is a ‘link’ function connects a function of regressors to the output, and \(\mu\) is the mean of a conditional response distribution at a given point in the space of regressors. When \(g(\mu) = X'\beta\), we just get regular OLS. When it’s logit, we have

\[ {\displaystyle \mu= \mathbb{E}(Y\mid X=x) =g^{-1}(X'\beta)= \frac{1}{1 + e^{-X'\beta}}.} \]

But as well as the ones we’ve seen, there are many possible link functions one can use via the catch-all glm function. These come in different ‘families’ of distributions, with the default for the binomial family being logit. So, running smf.glm('GRADE ~ GPA + TUCE + PSI', data=df, family=sm.families.Binomial()).fit() will produce exactly the same as we got both using the logit function. For more on the families of distributions and possible link functions, see the relevant part of the statsmodels documentation.

If you need a library dedicated to GLMs that has all the bells and whistles you can dream of, you might want to check out glum. At the time of writing, it is faster than either GLMnet or H2O (two other popular GLM libraries).

Linear Mixed Effects Regressions (LMER)#

Linear mixed models are an extension of simple linear models to allow both fixed and random effects.

We’ll be using the statsmodels package to demonstrate this, along with the dietox dataset, which has data on the growth curves of pigs. The data are taken from a study of the “Influence of Dietary Rapeseed Oli, Vitamin E, and Copper on Performance and Antioxidant and Oxidative Status of Pigs” Lauridsen et al. [1999].

data = sm.datasets.get_rdataset('dietox', 'geepack').data
data.head()
Pig Evit Cu Litter Start Weight Feed Time
0 4601 Evit000 Cu000 1 26.5 26.50000 NaN 1
1 4601 Evit000 Cu000 1 26.5 27.59999 5.200005 2
2 4601 Evit000 Cu000 1 26.5 36.50000 17.600000 3
3 4601 Evit000 Cu000 1 26.5 40.29999 28.500000 4
4 4601 Evit000 Cu000 1 26.5 49.09998 45.200001 5

Now we’re going to regress the weight of the pigs on the time and a couple of random effects.

We fit two random effects for each pig: a random intercept, and a random slope (with respect to time). This means that each pig may have a different baseline weight, as well as grow at a different rate.

md = smf.mixedlm("Weight ~ Time", data, groups=data["Pig"], re_formula="~Time")
mdf = md.fit(method=["lbfgs"])
mdf.summary()
Model: MixedLM Dependent Variable: Weight
No. Observations: 861 Method: REML
No. Groups: 72 Scale: 6.0372
Min. group size: 11 Log-Likelihood: -2217.0475
Max. group size: 12 Converged: Yes
Mean group size: 12.0
Coef. Std.Err. z P>|z| [0.025 0.975]
Intercept 15.739 0.550 28.603 0.000 14.660 16.817
Time 6.939 0.080 86.925 0.000 6.783 7.095
Group Var 19.503 1.561
Group x Time Cov 0.294 0.153
Time Var 0.416 0.033

This is saying that we have 72 pigs with 861 observations between them. The “Group Var” here is the random effect intercept for pigs, and it has a variance of 19.5. Meanwhile the slope random effect has a variance of 0.416. In addition to these, there are the two fixed effects. It’s easier to see where all of these estimates come out by plotting them. We’ll be honest, plotting the estimates hasn’t been made easy—so getting the data out and configuring it isn’t as easy as it could be.

First we pop the random effects results into a dataframe

group_time_ests = pd.DataFrame.from_dict(mdf.random_effects).T
group_time_ests["position_group"] = "Group"
group_time_ests["position_time"] = "Time"
group_time_ests["Group"] = group_time_ests["Group"] + mdf.fe_params["Intercept"]
group_time_ests["Time"] = group_time_ests["Time"] + mdf.fe_params["Time"]
group_time_ests.head()
Group Time position_group position_time
4601 15.109694 6.848209 Group Time
4602 16.133875 7.364919 Group Time
4603 17.846672 7.235199 Group Time
4605 19.771567 7.374041 Group Time
4641 17.265125 8.143678 Group Time

Now let’s plot the data:

import seaborn.objects as so

details_dict = {"Intercept": 0, "Time": 1}

fig, ax = plt.subplots()
ax.set_title("Random effects: Pig model")

for key, value in details_dict.items():
    ax.scatter(mdf.fe_params[key], value, s=50, color="black", edgecolor="k", zorder=3)
    ax.errorbar(mdf.fe_params[key], value, xerr=mdf.conf_int().loc[key].diff().dropna(), ls='', lw=4, color="k")

sns_plot = (
    so.Plot(group_time_ests, "Group", "position_group")
    .add(so.Dots(), so.Jitter())
    .add(so.Dots(), so.Jitter(), x="Time", y="position_time")
    .label(x="Estimate", y="Value")
    .on(ax)
    .plot()
)
_images/econmt-regression_111_0.svg

Linear probability model#

When \(y\) takes values in \(\{0, 1\}\) but the model looks like

\[ y = x' \cdot \beta \]

and is estimated by OLS then you have a linear probability model. In this case, the interpretion of a unit change in \(x_i\) is that it induces a \(\beta_i\) change in probability of \(y\). Note that homoskedasticity does not hold for the linear probability model.

Violations of the classical linear model (CLM)#

Heteroskedasticity#

If an estimated model is homoskedastic then its random variables have equal (finite) variance. This is also known as homogeneity of variance. Another way of putting it is that, for all observations \(i\) in an estimated model \(y_i = X_i\hat{\beta} + \epsilon_i\) then

\[ \mathbb{E}(\epsilon_i \epsilon_i) = \sigma^2 \]

When this relationship does not hold, an estimated model is said to be heteroskedastic.

To test for heteroskedasticity, you can use statsmodels’ versions of the Breusch-Pagan or White tests with the null hypothesis that the estimated model is homoskedastic. If the null hypothesis is rejected, then standard errors, t-statistics, and F-statistics are invalidated. In this case, you will need HAC (heteroskedasticity and auto-correlation consistent) standard errors, t- and F-statistics.

To obtain HAC standard errors from existing regression results in a variable results, you can use (for 1 lag):

results.get_robustcov_results('HAC', maxlags=1).summary()

Quantile regression#

Quantile regression estimates the conditional quantiles of a response variable. In some cases, it can be more robust to outliers and, in the case of the \(q=0.5\) quantile it is equivalent LAD (Least Absolute Deviation) regression. Let’s look at an example of quantile regression in action, lifted direct from the statsmodels documentation and based on a Journal of Economic Perspectives paper by Koenker and Hallock.

df = sm.datasets.engel.load_pandas().data
df.head()
income foodexp
0 420.157651 255.839425
1 541.411707 310.958667
2 901.157457 485.680014
3 639.080229 402.997356
4 750.875606 495.560775

What we have here are two sets of related data. Let’s perform several quantile regressions from 0.1 to 0.9 in steps of 0.1

mod = smf.quantreg("foodexp ~ income", df)
quantiles = np.arange(0.1, 1.0, 0.1)
q_results = [mod.fit(q=x) for x in quantiles]

The \(q=0.5\) entry will be at the 4 index; let’s take a look at it:

print(q_results[4].summary())
                         QuantReg Regression Results                          
==============================================================================
Dep. Variable:                foodexp   Pseudo R-squared:               0.6206
Model:                       QuantReg   Bandwidth:                       64.51
Method:                 Least Squares   Sparsity:                        209.3
Date:                Wed, 02 Nov 2022   No. Observations:                  235
Time:                        23:48:46   Df Residuals:                      233
                                        Df Model:                            1
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     81.4823     14.634      5.568      0.000      52.649     110.315
income         0.5602      0.013     42.516      0.000       0.534       0.586
==============================================================================

The condition number is large, 2.38e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Let’s take a look at the results for all of the regressions and let’s add in OLS for comparison:

ols_res = smf.ols("foodexp ~ income", df).fit()

get_y = lambda a, b: a + b * x
x = np.arange(df.income.min(), df.income.max(), 50)
# Just to make the plot clearer
x_max = 3000
x = x[x < x_max]

fig, ax = plt.subplots(figsize=(8, 6))
df.plot.scatter(
    ax=ax, x="income", y="foodexp", alpha=0.7, s=10, zorder=2, edgecolor=None
)
for i, res in enumerate(q_results):
    y = get_y(res.params["Intercept"], res.params["income"])
    ax.plot(x, y, color="grey", lw=0.5, zorder=0, linestyle=(0, (5, 10)))
    ax.annotate(f"$q={quantiles[i]:1.1f}$", xy=(x.max(), y.max()))
y = get_y(ols_res.params["Intercept"], ols_res.params["income"])
ax.plot(x, y, color="red", label="OLS", zorder=0)
ax.legend()
ax.set_xlim(0, x_max)
plt.show()
_images/econmt-regression_121_0.svg

This chart shows very clearly how quantile regression differs from OLS. The line fitted by OLS is trying to be all things to all points whereas the line fitted by quantile regression is focused only on its quantile. You can also see how points far from the median (not all shown) may be having a large influence on the OLS line.

Rolling and recursive regressions#

Rolling ordinary least squares applies OLS (ordinary least squares) across a fixed window of observations and then rolls (moves or slides) that window across the data set. They key parameter is window, which determines the number of observations used in each OLS regression. Recursive regression is equivalent to rolling regression but with a window that expands over time.

Let’s first create some synthetic data to perform estimation on:

from statsmodels.regression.rolling import RollingOLS
import statsmodels.api as sm
from sklearn.datasets import make_regression

X, y = make_regression(n_samples=200, n_features=2, random_state=0, noise=4.0, bias=0)
df = pd.DataFrame(X).rename(columns={0: "feature0", 1: "feature1"})
df["target"] = y
df.head()
feature0 feature1 target
0 -0.955945 -0.345982 -36.740556
1 -1.225436 0.844363 7.190031
2 -0.692050 1.536377 44.389018
3 0.010500 1.785870 57.019515
4 -0.895467 0.386902 -16.088554

Now let’s fit the model using a formula and a window of 25 steps.

roll_reg = RollingOLS.from_formula(
    "target ~ feature0 + feature1 -1", window=25, data=df
)
model = roll_reg.fit()

Note that -1 in the formala suppresses the intercept. We can see the parameters using model.params. Here are the params for time steps between 20 and 30:

model.params[20:30]
feature0 feature1
20 NaN NaN
21 NaN NaN
22 NaN NaN
... ... ...
27 20.532655 34.919468
28 20.470171 35.365235
29 20.002261 35.666997

10 rows × 2 columns

Note that there aren’t parameters for entries between 0 and 23 because our window is 25 steps wide. We can easily look at how any of the coefficients are changing over time. Here’s an example for ‘feature0’.

fig = model.plot_recursive_coefficient(variables=["feature0"])
plt.xlabel("Time step")
plt.ylabel("Coefficient value")
plt.show()
_images/econmt-regression_130_0.svg

A rolling regression with an expanding rather than moving window is effectively a recursive least squares model. We can do this instead using the RecursiveLS function from statsmodels. Let’s fit this to the whole dataset:

reg_rls = sm.RecursiveLS.from_formula("target ~ feature0 + feature1 -1", df)
model_rls = reg_rls.fit()
print(model_rls.summary())
                           Statespace Model Results                           
==============================================================================
Dep. Variable:                 target   No. Observations:                  200
Model:                    RecursiveLS   Log Likelihood                -570.923
Date:                Wed, 02 Nov 2022   R-squared:                       0.988
Time:                        23:48:47   AIC                           1145.847
Sample:                             0   BIC                           1152.444
                                - 200   HQIC                          1148.516
Covariance Type:            nonrobust   Scale                           17.413
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
feature0      20.6872      0.296     69.927      0.000      20.107      21.267
feature1      34.0655      0.302    112.870      0.000      33.474      34.657
===================================================================================
Ljung-Box (L1) (Q):                   2.02   Jarque-Bera (JB):                 3.93
Prob(Q):                              0.16   Prob(JB):                         0.14
Heteroskedasticity (H):               1.17   Skew:                            -0.31
Prob(H) (two-sided):                  0.51   Kurtosis:                         3.31
===================================================================================

Warnings:
[1] Parameters and covariance matrix estimates are RLS estimates conditional on the entire sample.

But now we can look back at how the values of the coefficients changed over time too:

fig = model_rls.plot_recursive_coefficient(
    range(reg_rls.k_exog), legend_loc="upper right"
)
ax_list = fig.axes
for ax in ax_list:
    ax.set_xlim(0, None)
ax_list[-1].set_xlabel("Time step")
ax_list[0].set_title("Coefficient value");
_images/econmt-regression_134_0.svg

Regression plots#

statsmodels has a number of built-in plotting methods to help you understand how well your regression is capturing the relationships you’re looking for. Let’s see a few examples of these using statsmodels built-in Statewide Crime data set:

crime_data = sm.datasets.statecrime.load_pandas()
print(sm.datasets.statecrime.NOTE)
::

    Number of observations: 51
    Number of variables: 8
    Variable name definitions:

    state
        All 50 states plus DC.
    violent
        Rate of violent crimes / 100,000 population. Includes murder, forcible
        rape, robbery, and aggravated assault. Numbers for Illinois and
        Minnesota do not include forcible rapes. Footnote included with the
        American Statistical Abstract table reads:
        "The data collection methodology for the offense of forcible
        rape used by the Illinois and the Minnesota state Uniform Crime
        Reporting (UCR) Programs (with the exception of Rockford, Illinois,
        and Minneapolis and St. Paul, Minnesota) does not comply with
        national UCR guidelines. Consequently, their state figures for
        forcible rape and violent crime (of which forcible rape is a part)
        are not published in this table."
    murder
        Rate of murders / 100,000 population.
    hs_grad
        Percent of population having graduated from high school or higher.
    poverty
        % of individuals below the poverty line
    white
        Percent of population that is one race - white only. From 2009 American
        Community Survey
    single
        Calculated from 2009 1-year American Community Survey obtained obtained
        from Census. Variable is Male householder, no wife present, family
        household combined with Female householder, no husband present, family
        household, divided by the total number of Family households.
    urban
        % of population in Urbanized Areas as of 2010 Census. Urbanized
        Areas are area of 50,000 or more people.

First, let’s look at a Q-Q plot to get a sense of how the variables are distributed. This uses scipy’s stats module. The default distribution is normal but you can use any that scipy supports.

st.probplot(crime_data.data["murder"], dist="norm", plot=plt);
_images/econmt-regression_138_0.svg

Clearly, this is not quite normal and there are some serious outliers in the tails.

Let’s run take a look at the unconditional relationship we’re interested in: how murder depends on high school graduation. We’ll use plotnine’s geom_smooth to do this but bear in mind it will only run a linear model of 'murder ~ hs_grad' and ignore the other covariates.

from plotnine import *

(
    ggplot(crime_data.data, aes(y="murder", x="hs_grad"))
    + geom_point()
    + geom_smooth(method="lm")
)
_images/econmt-regression_140_0.svg
<ggplot: (8771078653543)>

We can take into account those other factors by using a partial regression plot that asks what does \(\mathbb{E}(y|X)\) look like as a function of \(\mathbb{E}(x_i|X)\)? (Use obs_labels=False to remove data point labels.)

with plt.rc_context({"font.size": 5}):
    sm.graphics.plot_partregress(
        endog="murder",
        exog_i="hs_grad",
        exog_others=["urban", "poverty", "single"],
        data=crime_data.data,
        obs_labels=True,
    )
    plt.show()
eval_env: 1
_images/econmt-regression_142_1.svg

At this point, the results of the regression are useful context.

results_crime = smf.ols(
    "murder ~ hs_grad + urban + poverty + single", data=crime_data.data
).fit()
print(results_crime.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 murder   R-squared:                       0.813
Model:                            OLS   Adj. R-squared:                  0.797
Method:                 Least Squares   F-statistic:                     50.08
Date:                Wed, 02 Nov 2022   Prob (F-statistic):           3.42e-16
Time:                        23:48:49   Log-Likelihood:                -95.050
No. Observations:                  51   AIC:                             200.1
Df Residuals:                      46   BIC:                             209.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -44.1024     12.086     -3.649      0.001     -68.430     -19.774
hs_grad        0.3059      0.117      2.611      0.012       0.070       0.542
urban          0.0109      0.015      0.707      0.483      -0.020       0.042
poverty        0.4121      0.140      2.939      0.005       0.130       0.694
single         0.6374      0.070      9.065      0.000       0.496       0.779
==============================================================================
Omnibus:                        1.618   Durbin-Watson:                   2.507
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                0.831
Skew:                          -0.220   Prob(JB):                        0.660
Kurtosis:                       3.445   Cond. No.                     5.80e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Putting the multicollinearity problems to one side, we see that the relationship shown in the partial regression plot is also implied by the coefficient on hs_grad in the regression table.

We can also look at an in-depth summary of one exogenous regressor and its relationship to the outcome variable. Each of these types of regression diagnostic are available individually, or for all regressors at once, too. The first panel is the chart we did with plotnine rendered differently (and, one could argue, more informatively). Most of the plots below are self-explanatory except for the third one, the CCPR (Component-Component plus Residual) plot. This provides a way to judge the effect of one regressor on the response variable by taking into account the effects of the other independent variables.

fig = plt.figure(figsize=(8, 6), dpi=150)

sm.graphics.plot_regress_exog(results_crime, "hs_grad", fig=fig)
plt.tight_layout()
plt.show()
eval_env: 1
_images/econmt-regression_147_1.svg

statsmodels can also produce influence plots of the ‘externally studentised’ residuals vs. the leverage of each observation as measured by the so-called hat matrix \(X(X^{\;\prime}X)^{-1}X^{\;\prime}\) (because it puts the ‘hat’ on \(y\)). Externally studentised residuals are residuals that are scaled by their standard deviation. High leverage points could exert an undue influence over the regression line, but only if the predicted \(y\) values of a regression that was fit with them excluded was quite different. In the example below, DC is having a big influence.

with plt.rc_context({"font.size": 6}):
    sm.graphics.influence_plot(results_crime)
_images/econmt-regression_150_0.svg

Finally, it’s nice to be able to see plots of our coefficients along with their standard errors. There isn’t a built-in statsmodels option for this, but happily it’s easy to extract the results of regressions in a sensible format. Using the results object from earlier, and excluding the intercept, we can get the coefficients from results.params[1:] and the associated errors from results.bse[1:].

# Put the results into a dataframe with Name, Coefficient, Error
res_df = (
    pd.concat([results_crime.params[1:], results_crime.bse[1:]], axis=1)
    .reset_index()
    .rename(columns={"index": "Name", 0: "Coefficient", 1: "Error"})
)
# Plot the coefficient values and their errors
(
    ggplot(res_df)
    + geom_point(aes("Name", "Coefficient"))
    + geom_errorbar(aes(x="Name", ymin="Coefficient-Error", ymax="Coefficient+Error"))
)
_images/econmt-regression_152_0.svg
<ggplot: (8771081544138)>

Binscatter#

We’re going to follow the excellent example in this blog post by Matteo Courthoud.

First we’ll generate some fake data:

def dgp_marketplace(N=10_000):
    # Does the firm sells only online?
    online = prng.binomial(1, 0.5, N)
    # How many products does the firm have
    products = 1 + prng.poisson(1, N)
    # What is the age of the firm
    t = prng.exponential(0.5*products, N) 
    # Sales
    sales = 1e3 * prng.exponential(products + np.maximum((1 + 0.3*products + 4*online)*t - 0.5*(1 + 6*online)*t**2, 0), N)
    # Generate the dataframe
    df = pd.DataFrame({'age': t, 'sales': sales, 'online': online, 'products': products})
    #df["online"] = df["online"].astype("category")
    return df
sales = dgp_marketplace(N=10_000)
sales.head()
age sales online products
0 2.507089 1383.896707 0 2
1 0.074963 2738.994208 1 2
2 0.259299 10876.933496 0 6
3 1.626234 12159.344328 0 2
4 0.026536 1584.629056 0 3

We have pulled down information on 10,000 firms. For each firm we know:

  • age: the age of the firm

  • sales: the monthly sales from last month

  • online: whether the firm is only active online

  • products: the number of products that the firm offers

Suppose we are interested in understanding the relationship between age and sales. What is the life-cycle of sales?

Let’s start with a simple scatterplot of sales vs age.

import seaborn.objects as so

(
    so.Plot(sales, x="age", y="sales")
    .add(so.Dot(edgecolor="k"))
)
_images/econmt-regression_157_0.png

This plot is far too crowded to get a real sense of the relationship here.

A linear regression can help us unpick what the relationship really is

smf.ols('np.log(sales) ~ np.log(age)', sales).fit().summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 7.3964 0.015 488.757 0.000 7.367 7.426
np.log(age) 0.1503 0.010 15.588 0.000 0.131 0.169

There is a pretty strong positive relationship here that wasn’t evident in the scatter plot. BUT it’s possible that this relationships depends on whether the firms are online-only or not. We can condition on the other variables and do another regression:

smf.ols('np.log(sales) ~ np.log(age) + C(online) + products', sales).fit().summary().tables[1]
coef std err t P>|t| [0.025 0.975]
Intercept 6.5413 0.036 181.200 0.000 6.471 6.612
C(online)[T.1] 0.1798 0.026 6.888 0.000 0.129 0.231
np.log(age) 0.0642 0.010 6.456 0.000 0.045 0.084
products 0.3521 0.014 25.191 0.000 0.325 0.379

The coefficient now looks very different having conditioned on the other confounders. We can see that sales still increase with age, but we don’t know if this relationship is linear or not. We could add extra terms (like age squared) to explore this.

Or, we could use a binned scatterplot to analyse the relationship non-parametrically. The binned scatterplot divides the conditioning variable, age in our example, into equally sized bins or quantiles, and then plots the conditional mean of the dependent variable, sales in our example, within each bin.

We’ll be using the binsreg package for this Cattaneo et al. [2019].

Binned scatterplots do not just provide conditional means for optimally chosen intervals; they can also provide inference for these means. In particular, they will give confidence intervals around each data point. In the binsreg package, the ci option adds confidence intervals to the estimation results.

As the binsreg package is not written in the most Pythonic way, we’ll adorn it a bit to make it more easily used with dataframes.

import binsreg

def binscatter(**kwargs):
    # Estimate binsreg
    est = binsreg.binsreg(**kwargs)
    # Retrieve estimates
    df_est = pd.concat([d.dots for d in est.data_plot])
    df_est = df_est.rename(columns={'x': kwargs.get("x"), 'fit': kwargs.get("y")})
    # Add confidence intervals
    if "ci" in kwargs:
        df_est = pd.merge(df_est, pd.concat([d.ci for d in est.data_plot]))
        df_est = df_est.drop(columns=['x'])
        df_est['ci'] = df_est['ci_r'] - df_est['ci_l']
    # Rename groups
    if "by" in kwargs:
        df_est['group'] = df_est['group'].astype(kwargs.get("data")[kwargs.get("by")].dtype)
        df_est = df_est.rename(columns={'group': kwargs.get("by")})

    return df_est
# Estimate binsreg
br_est = binscatter(x='age', y='sales', data=sales, ci=(3,3))
br_est.head()
group age bin isknot mid sales ci_l ci_r ci
0 Full Sample 0.010859 0 0 0 1464.787623 1071.515840 1583.280880 511.765040
1 Full Sample 0.033589 1 0 0 1796.147300 1597.311734 2008.197828 410.886094
2 Full Sample 0.059490 2 0 0 1859.670036 1671.846295 2109.052975 437.206680
3 Full Sample 0.086708 3 0 0 1886.426934 1669.795790 2095.450639 425.654848
4 Full Sample 0.114412 4 0 0 1938.436076 1709.891878 2248.915920 539.024041

We can now plot the results

fig, ax = plt.subplots()

sns_plot = (
    so.Plot(br_est, x="age", y="sales")
    .add(so.Dots())
)
ax.errorbar('age', 'sales', yerr='ci', data=br_est, ls='', lw=2, alpha=0.2)
ax.set_title("Binscatter: sales as a function of age")
sns_plot.on(ax).show()
_images/econmt-regression_166_0.svg

We can now see that the relationship is non-linear with a sharp increase in sales at the beginning of the lifetime of a firm, followed by a plateau.

As noted, this relationship may be modified by other variables though. So let’s now condition on those.

To condition on products, we pass w=["products"] to the binscatter function.

# Estimate binsreg
br_est = binscatter(x='age', y='sales', w=['products'], data=sales, ci=(3,3))
br_est.head()
group age bin isknot mid sales ci_l ci_r ci
0 Full Sample 0.011594 0 0 0 1993.431855 1696.328657 2194.308173 497.979516
1 Full Sample 0.036091 1 0 0 2205.494696 2020.644696 2413.209800 392.565104
2 Full Sample 0.063776 2 0 0 2456.340262 2223.387911 2629.754210 406.366299
3 Full Sample 0.093066 3 0 0 2346.037534 2173.715566 2574.592536 400.876970
4 Full Sample 0.122806 4 0 0 2425.527896 2160.148380 2686.659801 526.511421

Now let’s look at the plot of this:

fig, ax = plt.subplots()

sns_plot = (
    so.Plot(br_est, x="age", y="sales")
    .add(so.Dots())
)
ax.errorbar('age', 'sales', yerr='ci', data=br_est, ls='', lw=2, alpha=0.2)
ax.set_title("Binscatter: sales as a function of age conditioned on products")
sns_plot.on(ax).show()
_images/econmt-regression_170_0.svg

Conditional on the number of products, the shape of the sales life-cycle changes further. Now, after an initial increase in sales, we observe a gradual decrease over time.

Do online-only firms have different sales life-cycles with respect to mixed online-offline firms? We can produce different binned scatterplots by group using the option by.

# Estimate binsreg
br_est = binscatter(x='age', y='sales', w=['products'], by="online", data=sales, ci=(3,3))
br_est["online"] = br_est["online"].astype("category")
# plot the two groups
fig, ax = plt.subplots(figsize=(7, 5))

sns_plot = (
    so.Plot(br_est, x="age", y="sales", color="online")
    .add(so.Dots(pointsize=7))
)
for value in br_est["online"].unique():
    ax.errorbar('age', 'sales', yerr='ci', data=br_est.loc[br_est["online"]==value], ls='', lw=2, alpha=0.2)
ax.set_title("Binscatter: sales as a function of age conditioned on products", size=16)
sns_plot.on(ax).show()
_images/econmt-regression_172_0.svg

From the binned scatterplot, we can see that online products have on average shorter lifetimes, with a higher initial peak in sales, followed by a sharper decline.

Hopefully this shows you how you can use binsreg to get a better understanding of causal relationships in your data.

You can find out more about bin scatters in this video by Paul Goldsmith-Pinkham.

Specification curve analysis#

When specifying a model, modellers have many options. These can be informed by field intelligence, priors, and even misguided attempts to find a significant result. Even with the best of intentions, research teams can reach entirely different conclusions using the same, or similar, data because of different choices made in preparing data or in modelling it.

There’s formal evidence that researchers really do make different decisions; one study Silberzahn et al. [2018] gave the same research question - whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players - to 29 different teams. From the abstract of that paper:

Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability.

So not only were different decisions made, there seems to be no clearly identifiable reason for them. There is usually scope for reasonable alternative model specifications when estimating coefficients, and those coefficients will vary with those specifications.

Specification curve analysis Simonsohn et al. [2020] looks for a more exhaustive way of trying out alternative specifications. The three steps of specification curve analysis are:

  1. identifying the set of theoretically justified, statistically valid, and non-redundant analytic specifications;

  2. displaying alternative results graphically, allowing the identification of decisions producing different results; and

  3. conducting statistical tests to determine whether as a whole results are inconsistent with the null hypothesis.

For a good example of specification curve analysis in action, see this recent Nature Human Behaviour paper Orben and Przybylski [2019] on the association between adolescent well-being and the use of digital technology.

We’ll use the specification curve analysis package to do the first two, which you can install with pip install specification_curve. To demonstrate the full functionality, we’ll create a second, alternative ‘hp’ that is a transformed version of the original.

mpg["hp_boxcox"], _ = st.boxcox(mpg["hp"])

Now let’s create a specification curve. We need to specify the data, the different outcome variables we’d like to try, y_endog; the different possible versions of the main regressor of interest, x_exog; the possible controls, controls; any controls that should always be included, always_include; and any categorical variables to include class-by-class, cat_expand. Some of these accept lists of variables as well as single reggressors. The point estimates that have confidence intervals which include zero are coloured in grey, instead of blue. There is also an exclu_grps option to exclude certain combinations of regressors, and you can pass alternative estimators to fit, for example fit(estimator=sm.Logit).

import specification_curve as specy

sc = specy.SpecificationCurve(
    mpg,
    y_endog="mpg",
    x_exog=["lnhp", "hp_boxcox"],
    controls=["drat", "qsec", "cyl", "gear"],
    always_include=["gear"],
    cat_expand="cyl",
)
sc.fit()
sc.plot()
Fit complete
_images/econmt-regression_177_1.svg

Review#

In this very short introduction to regression with code, you should have learned how to:

  • ✅ perform linear OLS regressions with code;

  • ✅ add fixed effects/categorical variables to regressions;

  • ✅ use different standard errors;

  • ✅ use models with transformed regressors;

  • ✅ use the formula or array APIs for statsmodels and linearmodels;

  • ✅ show the results from multiple models;

  • ✅ perform IV regressions;

  • ✅ perform GLM regressions; and

  • ✅ use plots as a way to interrogate regression results.