Regression diagnostics and visualisations

Regression diagnostics and visualisations#

Introduction#

In this chapter, you’ll see some diagnostics and visualisations that are closely tied to regressions.

Most of this chapter will rely on the amazing statsmodels package, including for some of the examples. Some of the material in this chapter follows Grant McDermott’s excellent notes and the Library of Statistical Translation.

Imports#

Let’s import some of the packages we’ll be using:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
import statsmodels.formula.api as smf
from lets_plot import *

LetsPlot.setup_html()

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(
    seed_for_prng
)  # prng=probabilistic random number generator

Regression plots#

statsmodels has a number of built-in plotting methods to help you understand how well your regression is capturing the relationships you’re looking for. Let’s see a few examples of these using statsmodels built-in Statewide Crime data set:

crime_data = sm.datasets.statecrime.load_pandas()
print(sm.datasets.statecrime.NOTE)

::

    Number of observations: 51
    Number of variables: 8
    Variable name definitions:

    state
        All 50 states plus DC.
    violent
        Rate of violent crimes / 100,000 population. Includes murder, forcible
        rape, robbery, and aggravated assault. Numbers for Illinois and
        Minnesota do not include forcible rapes. Footnote included with the
        American Statistical Abstract table reads:
        "The data collection methodology for the offense of forcible
        rape used by the Illinois and the Minnesota state Uniform Crime
        Reporting (UCR) Programs (with the exception of Rockford, Illinois,
        and Minneapolis and St. Paul, Minnesota) does not comply with
        national UCR guidelines. Consequently, their state figures for
        forcible rape and violent crime (of which forcible rape is a part)
        are not published in this table."
    murder
        Rate of murders / 100,000 population.
    hs_grad
        Percent of population having graduated from high school or higher.
    poverty
        % of individuals below the poverty line
    white
        Percent of population that is one race - white only. From 2009 American
        Community Survey
    single
        Calculated from 2009 1-year American Community Survey obtained obtained
        from Census. Variable is Male householder, no wife present, family
        household combined with Female householder, no husband present, family
        household, divided by the total number of Family households.
    urban
        % of population in Urbanized Areas as of 2010 Census. Urbanized
        Areas are area of 50,000 or more people.

First, let’s look at a Q-Q plot to get a sense of how the variables are distributed. This uses scipy’s stats module. The default distribution is normal but you can use any that scipy supports.

st.probplot(crime_data.data["murder"], dist="norm", plot=plt);

_images/6b87638e9da21c19fb78044c5b18a911a1e5785289ac6f07d483bc55f4acce37.svg

Clearly, this is not quite normal and there are some serious outliers in the tails.

Let’s run take a look at the unconditional relationship we’re interested in: how murder depends on high school graduation. We’ll use lets-plot’s geom_smooth() to do this but bear in mind it will only run a linear model of 'murder ~ hs_grad' and ignore the other covariates.

p = (
    ggplot(crime_data.data, aes(y="murder", x="hs_grad"))
    + geom_point()
    + geom_smooth(method="lm")
)
p.show()

We can take into account those other factors by using a partial regression plot that asks what does \(\mathbb{E}(y|X)\) look like as a function of \(\mathbb{E}(x_i|X)\)? (Use obs_labels=False to remove data point labels.)

with plt.rc_context({"font.size": 5}):
    sm.graphics.plot_partregress(
        endog="murder",
        exog_i="hs_grad",
        exog_others=["urban", "poverty", "single"],
        data=crime_data.data,
        obs_labels=True,
    )
    plt.show()

_images/b2ab4f84675df728c539d3d606d6ee579140ea9bf15379b7d74ce13ebecc2305.svg

At this point, the results of the regression are useful context.

results_crime = smf.ols(
    "murder ~ hs_grad + urban + poverty + single", data=crime_data.data
).fit()
print(results_crime.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 murder   R-squared:                       0.813
Model:                            OLS   Adj. R-squared:                  0.797
Method:                 Least Squares   F-statistic:                     50.08
Date:                Wed, 26 Mar 2025   Prob (F-statistic):           3.42e-16
Time:                        16:02:14   Log-Likelihood:                -95.050
No. Observations:                  51   AIC:                             200.1
Df Residuals:                      46   BIC:                             209.8
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -44.1024     12.086     -3.649      0.001     -68.430     -19.774
hs_grad        0.3059      0.117      2.611      0.012       0.070       0.542
urban          0.0109      0.015      0.707      0.483      -0.020       0.042
poverty        0.4121      0.140      2.939      0.005       0.130       0.694
single         0.6374      0.070      9.065      0.000       0.496       0.779
==============================================================================
Omnibus:                        1.618   Durbin-Watson:                   2.507
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                0.831
Skew:                          -0.220   Prob(JB):                        0.660
Kurtosis:                       3.445   Cond. No.                     5.80e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Putting the multicollinearity problems to one side, we see that the relationship shown in the partial regression plot is also implied by the coefficient on hs_grad in the regression table.

We can also look at an in-depth summary of one exogenous regressor and its relationship to the outcome variable. Each of these types of regression diagnostic are available individually, or for all regressors at once, too. The first panel is the chart we did with lets-plot rendered differently (and, one could argue, more informatively). Most of the plots below are self-explanatory except for the third one, the CCPR (Component-Component plus Residual) plot. This provides a way to judge the effect of one regressor on the response variable by taking into account the effects of the other independent variables.

fig = plt.figure(figsize=(8, 6), dpi=150)

sm.graphics.plot_regress_exog(results_crime, "hs_grad", fig=fig)
plt.show()

_images/b13764638473331e743a51575e75a15a9b401998098ac5aa78b46eb476d46741.svg

statsmodels can also produce influence plots of the ‘externally studentised’ residuals vs. the leverage of each observation as measured by the so-called hat matrix \(X(X^{\;\prime}X)^{-1}X^{\;\prime}\) (because it puts the ‘hat’ on \(y\)). Externally studentised residuals are residuals that are scaled by their standard deviation. High leverage points could exert an undue influence over the regression line, but only if the predicted \(y\) values of a regression that was fit with them excluded was quite different. In the example below, DC is having a big influence.

with plt.rc_context({"font.size": 6}):
    sm.graphics.influence_plot(results_crime)

_images/34af4209276305212833a063a0679a0735b79ed0e87dadd923a0db3a14329178.svg

Finally, it’s nice to be able to see plots of our coefficients along with their standard errors. There isn’t a built-in statsmodels option for this, but happily it’s easy to extract the results of regressions in a sensible format. Using the results object from earlier, and excluding the intercept, we can get the coefficients from results.params[1:] and the associated errors from results.bse[1:].

# Put the results into a dataframe with Name, Coefficient, Error
res_df = (
    pd.concat([results_crime.params[1:], results_crime.bse[1:]], axis=1)
    .reset_index()
    .rename(columns={"index": "Name", 0: "Coefficient", 1: "Error"})
)
res_df["ymax"] = res_df["Coefficient"] + res_df["Error"]
res_df["ymin"] = res_df["Coefficient"] - res_df["Error"]
# Plot the coefficient values and their errors
(
    ggplot(res_df)
    + geom_point(aes("Name", "Coefficient"))
    + geom_errorbar(aes(x="Name", ymin="ymin", ymax="ymax"))
).show()

Binscatter#

We’re going to follow the excellent example in this blog post by Matteo Courthoud.

First we’ll generate some fake data:

def dgp_marketplace(N=10_000):
    # Does the firm sells only online?
    online = prng.binomial(1, 0.5, N)
    # How many products does the firm have
    products = 1 + prng.poisson(1, N)
    # What is the age of the firm
    t = prng.exponential(0.5 * products, N)
    # Sales
    sales = 1e3 * prng.exponential(
        products
        + np.maximum(
            (1 + 0.3 * products + 4 * online) * t - 0.5 * (1 + 6 * online) * t**2, 0
        ),
        N,
    )
    # Generate the dataframe
    df = pd.DataFrame(
        {"age": t, "sales": sales, "online": online, "products": products}
    )
    return df

sales = dgp_marketplace(N=10_000)
sales.head()

	age	sales	online	products
0	0.215330	1476.840811	0	2
1	0.306930	1624.207274	1	1
2	0.791730	6721.358263	1	3
3	0.157610	5520.048141	0	1
4	1.316726	1916.905960	0	3

We have pulled down information on 10,000 firms. For each firm we know:

age: the age of the firm
sales: the monthly sales from last month
online: whether the firm is only active online
products: the number of products that the firm offers

Suppose we are interested in understanding the relationship between age and sales. What is the life-cycle of sales?

Let’s start with a simple scatterplot of sales vs age.

(ggplot(sales, aes(x="age", y="sales")) + geom_point(color="blue", alpha=0.4)).show()

This plot is far too crowded to get a real sense of the relationship here.

A linear regression can help us unpick what the relationship really is

smf.ols("np.log(sales) ~ np.log(age)", sales).fit().summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.4012	0.015	487.440	0.000	7.371	7.431
np.log(age)	0.1531	0.010	15.777	0.000	0.134	0.172

There is a pretty strong positive relationship here that wasn’t evident in the scatter plot. BUT it’s possible that this relationships depends on whether the firms are online-only or not. We can condition on the other variables and do another regression:

smf.ols(
    "np.log(sales) ~ np.log(age) + C(online) + products", sales
).fit().summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	6.5121	0.036	178.775	0.000	6.441	6.584
C(online)[T.1]	0.1877	0.026	7.186	0.000	0.137	0.239
np.log(age)	0.0645	0.010	6.477	0.000	0.045	0.084
products	0.3665	0.014	26.256	0.000	0.339	0.394

The coefficient now looks very different having conditioned on the other confounders. We can see that sales still increase with age, but we don’t know if this relationship is linear or not. We could add extra terms (like age squared) to explore this.

Or, we could use a binned scatterplot to analyse the relationship non-parametrically. The binned scatterplot divides the conditioning variable, age in our example, into equally sized bins or quantiles, and then plots the conditional mean of the dependent variable, sales in our example, within each bin.

We’ll be using the binsreg package for this Cattaneo et al. [2019].

Binned scatterplots do not just provide conditional means for optimally chosen intervals; they can also provide inference for these means. In particular, they will give confidence intervals around each data point. In the binsreg package, the ci option adds confidence intervals to the estimation results.

As the binsreg package is not written in the most Pythonic way, we’ll adorn it a bit to make it more easily used with dataframes.

import binsreg


def binscatter(**kwargs):
    # Estimate binsreg
    est = binsreg.binsreg(**kwargs, noplot=True)
    # Retrieve estimates
    df_est = pd.concat([d.dots for d in est.data_plot])
    df_est = df_est.rename(columns={"x": kwargs.get("x"), "fit": kwargs.get("y")})
    # Add confidence intervals
    if "ci" in kwargs:
        df_est = pd.merge(df_est, pd.concat([d.ci for d in est.data_plot]))
        df_est = df_est.drop(columns=["x"])
        df_est["ci"] = df_est["ci_r"] - df_est["ci_l"]
    # Rename groups
    if "by" in kwargs:
        df_est["group"] = df_est["group"].astype(
            kwargs.get("data")[kwargs.get("by")].dtype
        )
        df_est = df_est.rename(columns={"group": kwargs.get("by")})

    return df_est

# Estimate binsreg
br_est = binscatter(x="age", y="sales", data=sales, ci=(3, 3))
br_est.head()

/home/runner/micromamba/envs/codeforecon/lib/python3.10/site-packages/binsreg/binsreg.py:784: UserWarning: To speed up computation, bin/degree selection uses a subsample of roughly max(5,000, 0.01n) observations if the sample size n>5,000. To use the full sample, set randcut=1.

	group	age	bin	sales	ci_l	ci_r	ci
0	Full Sample	0.011423	0	1471.687269	1170.680517	1682.830515	512.149998
1	Full Sample	0.035245	1	1700.341795	1539.840888	1932.583802	392.742913
2	Full Sample	0.062742	2	1840.164901	1534.090498	1966.424398	432.333900
3	Full Sample	0.091673	3	1884.065244	1619.438626	2059.687158	440.248532
4	Full Sample	0.122450	4	2196.136478	1958.519062	2525.866043	567.346981

We can now plot the results

(
    ggplot(br_est, aes(x="age", y="sales", ymin="ci_l", ymax="ci_r"))
    + geom_point()
    + geom_errorbar()
    + ggtitle("Binscatter: sales as a function of age")
)

We can now see that the relationship is non-linear with a sharp increase in sales at the beginning of the lifetime of a firm, followed by a plateau.

As noted, this relationship may be modified by other variables though. So let’s now condition on those.

To condition on products, we pass w=["products"] to the binscatter() function.

# Estimate binsreg
br_est = binscatter(x="age", y="sales", w=["products"], data=sales, ci=(3, 3))
br_est.head()

/home/runner/micromamba/envs/codeforecon/lib/python3.10/site-packages/binsreg/binsreg.py:784: UserWarning: To speed up computation, bin/degree selection uses a subsample of roughly max(5,000, 0.01n) observations if the sample size n>5,000. To use the full sample, set randcut=1.

	group	age	bin	sales	ci_l	ci_r	ci
0	Full Sample	0.013602	0	2087.953373	1831.700489	2297.705257	466.004767
1	Full Sample	0.042516	1	2167.297481	2007.649389	2389.329550	381.680160
2	Full Sample	0.076085	2	2279.090462	2104.168484	2476.730841	372.562358
3	Full Sample	0.111670	3	2437.261875	2284.513572	2745.773708	461.260136
4	Full Sample	0.147252	4	2849.811871	2501.530277	2986.581754	485.051477

Now let’s look at the plot of this:

(
    ggplot(br_est, aes(x="age", y="sales", ymin="ci_l", ymax="ci_r"))
    + geom_point()
    + geom_errorbar()
    + ggtitle("Binscatter: sales as a function of age conditioned on products")
)

Conditional on the number of products, the shape of the sales life-cycle changes further. Now, after an initial increase in sales, we observe a gradual decrease over time.

Do online-only firms have different sales life-cycles with respect to mixed online-offline firms? We can produce different binned scatterplots by group using the option by.

# Estimate binsreg
br_est = binscatter(
    x="age", y="sales", w=["products"], by="online", data=sales, ci=(3, 3)
)
br_est["online"] = br_est["online"].astype("boolean")
br_est.head()

/home/runner/micromamba/envs/codeforecon/lib/python3.10/site-packages/binsreg/binsreg.py:941: UserWarning: To speed up computation, bin/degree selection uses a subsample of roughly max(5,000, 0.01n) observations if the sample size n>5,000. To use the full sample, set randcut=1.

	online	age	bin	sales	ci_l	ci_r	ci
0	False	0.017810	0	2148.012043	1774.809400	2365.909544	591.100144
1	False	0.056041	1	2144.905095	1919.921103	2380.367142	460.446039
2	False	0.097980	2	2206.951051	2028.323642	2500.604182	472.280540
3	False	0.146331	3	2529.481251	2222.970487	2717.260673	494.290186
4	False	0.191721	4	2459.073079	2140.625594	2636.301404	495.675810

(
    ggplot(br_est, aes(x="age", y="sales", color="online", ymin="ci_l", ymax="ci_r"))
    + geom_point()
    + geom_errorbar()
    + ggtitle("Binscatter: sales as a function of age conditioned on products")
)

From the binned scatterplot, we can see that online products have on average shorter lifetimes, with a higher initial peak in sales, followed by a sharper decline.

Hopefully this shows you how you can use binsreg to get a better understanding of causal relationships in your data.

You can find out more about bin scatters in this video by Paul Goldsmith-Pinkham.

Specification curve analysis#

When specifying a model, modellers have many options. These can be informed by field intelligence, priors, and even misguided attempts to find a significant result. Even with the best of intentions, research teams can reach entirely different conclusions using the same, or similar, data because of different choices made in preparing data or in modelling it.

There’s formal evidence that researchers really do make different decisions; one study [Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey, Bahník, Bai, Bannard, Bonnier, and others, 2018] gave the same research question - whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players - to 29 different teams. From the abstract of that paper:

Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability.

So not only were different decisions made, there seems to be no clearly identifiable reason for them. There is usually scope for reasonable alternative model specifications when estimating coefficients, and those coefficients will vary with those specifications.

Specification curve analysis [Simonsohn, Simmons, and Nelson, 2020] looks for a more exhaustive way of trying out alternative specifications. The three steps of specification curve analysis are:

identifying the set of theoretically justified, statistically valid, and non-redundant analytic specifications;
displaying alternative results graphically, allowing the identification of decisions producing different results; and
conducting statistical tests to determine whether as a whole results are inconsistent with the null hypothesis.

For a good example of specification curve analysis in action, see this recent Nature Human Behaviour paper [Orben and Przybylski, 2019] on the association between adolescent well-being and the use of digital technology.

We’ll use the specification curve analysis package to do the first two, which you can install with pip install specification_curve. To demonstrate the full functionality, we’ll use the “mpg” dataset and create a second, alternative ‘hp’ that is a transformed version of the original.

mpg = pd.read_csv(
    "https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv",
    dtype={"model": str, "mpg": float, "hp": float, "disp": float, "cyl": "category"},
)
mpg.head()

	model	mpg	cyl	disp	hp	drat	wt	qsec	vs	am	gear	carb
0	Mazda RX4	21.0	6	160.0	110.0	3.90	2.620	16.46	0	1	4	4
1	Mazda RX4 Wag	21.0	6	160.0	110.0	3.90	2.875	17.02	0	1	4	4
2	Datsun 710	22.8	4	108.0	93.0	3.85	2.320	18.61	1	1	4	1
3	Hornet 4 Drive	21.4	6	258.0	110.0	3.08	3.215	19.44	1	0	3	1
4	Hornet Sportabout	18.7	8	360.0	175.0	3.15	3.440	17.02	0	0	3	2

mpg["hp_boxcox"], _ = st.boxcox(mpg["hp"])
mpg["lnhp"] = np.log(mpg["hp"])

mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype   
---  ------     --------------  -----   
 0   model      32 non-null     object  
 1   mpg        32 non-null     float64 
 2   cyl        32 non-null     category
 3   disp       32 non-null     float64 
 4   hp         32 non-null     float64 
 5   drat       32 non-null     float64 
 6   wt         32 non-null     float64 
 7   qsec       32 non-null     float64 
 8   vs         32 non-null     int64   
 9   am         32 non-null     int64   
 10  gear       32 non-null     int64   
 11  carb       32 non-null     int64   
 12  hp_boxcox  32 non-null     float64 
 13  lnhp       32 non-null     float64 
dtypes: category(1), float64(8), int64(4), object(1)
memory usage: 3.5+ KB

Now let’s create a specification curve. We need to specify the data, the different outcome variables we’d like to try, y_endog; the different possible versions of the main regressor of interest, x_exog; the possible controls, controls; any controls that should always be included, always_include; and any categorical variables to include class-by-class, cat_expand. Some of these accept lists of variables as well as single reggressors. The point estimates that have confidence intervals which include zero are coloured in grey, instead of blue. There is also an exclu_grps option to exclude certain combinations of regressors, and you can pass alternative estimators to fit, for example fit(estimator=sm.Logit).

import specification_curve as specy

sc = specy.SpecificationCurve(
    mpg,
    y_endog="mpg",
    x_exog=["lnhp", "hp_boxcox"],
    controls=["drat", "qsec", "cyl", "gear"],
    always_include=["gear"],
    cat_expand="cyl",
)
sc.fit()
sc.plot()

_images/f00362c8a3694c76bd4f116bc86fc90564b67d5dd98d5c4899ad6510f54e8235.svg