# Regression diagnostics and visualisations#

## Introduction#

In this chapter, you’ll see some diagnostics and visualisations that are closely tied to regressions.

Most of this chapter will rely on the amazing statsmodels package, including for some of the examples. Some of the material in this chapter follows Grant McDermott’s excellent notes and the Library of Statistical Translation.

## Imports#

Let’s import some of the packages we’ll be using:

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from lets_plot import *
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as st

LetsPlot.setup_html()

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(seed_for_prng)  # prng=probabilistic random number generator


## Regression plots#

statsmodels has a number of built-in plotting methods to help you understand how well your regression is capturing the relationships you’re looking for. Let’s see a few examples of these using statsmodels built-in Statewide Crime data set:

crime_data = sm.datasets.statecrime.load_pandas()
print(sm.datasets.statecrime.NOTE)

::

Number of observations: 51
Number of variables: 8
Variable name definitions:

state
All 50 states plus DC.
violent
Rate of violent crimes / 100,000 population. Includes murder, forcible
rape, robbery, and aggravated assault. Numbers for Illinois and
Minnesota do not include forcible rapes. Footnote included with the
"The data collection methodology for the offense of forcible
rape used by the Illinois and the Minnesota state Uniform Crime
Reporting (UCR) Programs (with the exception of Rockford, Illinois,
and Minneapolis and St. Paul, Minnesota) does not comply with
national UCR guidelines. Consequently, their state figures for
forcible rape and violent crime (of which forcible rape is a part)
are not published in this table."
murder
Rate of murders / 100,000 population.
Percent of population having graduated from high school or higher.
poverty
% of individuals below the poverty line
white
Percent of population that is one race - white only. From 2009 American
Community Survey
single
Calculated from 2009 1-year American Community Survey obtained obtained
from Census. Variable is Male householder, no wife present, family
household combined with Female householder, no husband present, family
household, divided by the total number of Family households.
urban
% of population in Urbanized Areas as of 2010 Census. Urbanized
Areas are area of 50,000 or more people.


First, let’s look at a Q-Q plot to get a sense of how the variables are distributed. This uses scipy’s stats module. The default distribution is normal but you can use any that scipy supports.

st.probplot(crime_data.data["murder"], dist="norm", plot=plt);


Clearly, this is not quite normal and there are some serious outliers in the tails.

Let’s run take a look at the unconditional relationship we’re interested in: how murder depends on high school graduation. We’ll use lets-plot’s geom_smooth() to do this but bear in mind it will only run a linear model of 'murder ~ hs_grad' and ignore the other covariates.

p = (
+ geom_point()
+ geom_smooth(method="lm")
)
p.show()


We can take into account those other factors by using a partial regression plot that asks what does $$\mathbb{E}(y|X)$$ look like as a function of $$\mathbb{E}(x_i|X)$$? (Use obs_labels=False to remove data point labels.)

with plt.rc_context({"font.size": 5}):
sm.graphics.plot_partregress(
endog="murder",
exog_others=["urban", "poverty", "single"],
data=crime_data.data,
obs_labels=True,
)
plt.show()


At this point, the results of the regression are useful context.

results_crime = smf.ols(
"murder ~ hs_grad + urban + poverty + single", data=crime_data.data
).fit()
print(results_crime.summary())

                            OLS Regression Results
==============================================================================
Dep. Variable:                 murder   R-squared:                       0.813
Method:                 Least Squares   F-statistic:                     50.08
Date:                Fri, 05 Jan 2024   Prob (F-statistic):           3.42e-16
Time:                        15:38:48   Log-Likelihood:                -95.050
No. Observations:                  51   AIC:                             200.1
Df Residuals:                      46   BIC:                             209.8
Df Model:                           4
Covariance Type:            nonrobust
==============================================================================
coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -44.1024     12.086     -3.649      0.001     -68.430     -19.774
hs_grad        0.3059      0.117      2.611      0.012       0.070       0.542
urban          0.0109      0.015      0.707      0.483      -0.020       0.042
poverty        0.4121      0.140      2.939      0.005       0.130       0.694
single         0.6374      0.070      9.065      0.000       0.496       0.779
==============================================================================
Omnibus:                        1.618   Durbin-Watson:                   2.507
Prob(Omnibus):                  0.445   Jarque-Bera (JB):                0.831
Skew:                          -0.220   Prob(JB):                        0.660
Kurtosis:                       3.445   Cond. No.                     5.80e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.


Putting the multicollinearity problems to one side, we see that the relationship shown in the partial regression plot is also implied by the coefficient on hs_grad in the regression table.

We can also look at an in-depth summary of one exogenous regressor and its relationship to the outcome variable. Each of these types of regression diagnostic are available individually, or for all regressors at once, too. The first panel is the chart we did with lets-plot rendered differently (and, one could argue, more informatively). Most of the plots below are self-explanatory except for the third one, the CCPR (Component-Component plus Residual) plot. This provides a way to judge the effect of one regressor on the response variable by taking into account the effects of the other independent variables.

fig = plt.figure(figsize=(8, 6), dpi=150)

plt.show()


statsmodels can also produce influence plots of the ‘externally studentised’ residuals vs. the leverage of each observation as measured by the so-called hat matrix $$X(X^{\;\prime}X)^{-1}X^{\;\prime}$$ (because it puts the ‘hat’ on $$y$$). Externally studentised residuals are residuals that are scaled by their standard deviation. High leverage points could exert an undue influence over the regression line, but only if the predicted $$y$$ values of a regression that was fit with them excluded was quite different. In the example below, DC is having a big influence.

with plt.rc_context({"font.size": 6}):
sm.graphics.influence_plot(results_crime)


Finally, it’s nice to be able to see plots of our coefficients along with their standard errors. There isn’t a built-in statsmodels option for this, but happily it’s easy to extract the results of regressions in a sensible format. Using the results object from earlier, and excluding the intercept, we can get the coefficients from results.params[1:] and the associated errors from results.bse[1:].

# Put the results into a dataframe with Name, Coefficient, Error
res_df = (
pd.concat([results_crime.params[1:], results_crime.bse[1:]], axis=1)
.reset_index()
.rename(columns={"index": "Name", 0: "Coefficient", 1: "Error"})
)
res_df["ymax"] = res_df["Coefficient"] + res_df["Error"]
res_df["ymin"] = res_df["Coefficient"] - res_df["Error"]
# Plot the coefficient values and their errors
(
ggplot(res_df)
+ geom_point(aes("Name", "Coefficient"))
+ geom_errorbar(aes(x="Name", ymin="ymin", ymax="ymax"))
).show()


## Binscatter#

We’re going to follow the excellent example in this blog post by Matteo Courthoud.

First we’ll generate some fake data:

def dgp_marketplace(N=10_000):
# Does the firm sells only online?
online = prng.binomial(1, 0.5, N)
# How many products does the firm have
products = 1 + prng.poisson(1, N)
# What is the age of the firm
t = prng.exponential(0.5*products, N)
# Sales
sales = 1e3 * prng.exponential(products + np.maximum((1 + 0.3*products + 4*online)*t - 0.5*(1 + 6*online)*t**2, 0), N)
# Generate the dataframe
df = pd.DataFrame({'age': t, 'sales': sales, 'online': online, 'products': products})
return df

sales = dgp_marketplace(N=10_000)

age sales online products
0 0.215330 1476.840811 0 2
1 0.306930 1624.207274 1 1
2 0.791730 6721.358263 1 3
3 0.157610 5520.048141 0 1
4 1.316726 1916.905960 0 3

We have pulled down information on 10,000 firms. For each firm we know:

• age: the age of the firm

• sales: the monthly sales from last month

• online: whether the firm is only active online

• products: the number of products that the firm offers

Suppose we are interested in understanding the relationship between age and sales. What is the life-cycle of sales?

(
ggplot(sales, aes(x="age", y="sales")) +
geom_point(color="blue", alpha=0.4)
).show()


This plot is far too crowded to get a real sense of the relationship here.

A linear regression can help us unpick what the relationship really is

smf.ols('np.log(sales) ~ np.log(age)', sales).fit().summary().tables[1]

coef std err t P>|t| [0.025 0.975] 7.4012 0.015 487.440 0.000 7.371 7.431 0.1531 0.010 15.777 0.000 0.134 0.172

There is a pretty strong positive relationship here that wasn’t evident in the scatter plot. BUT it’s possible that this relationships depends on whether the firms are online-only or not. We can condition on the other variables and do another regression:

smf.ols('np.log(sales) ~ np.log(age) + C(online) + products', sales).fit().summary().tables[1]

coef std err t P>|t| [0.025 0.975] 6.5121 0.036 178.775 0.000 6.441 6.584 0.1877 0.026 7.186 0.000 0.137 0.239 0.0645 0.010 6.477 0.000 0.045 0.084 0.3665 0.014 26.256 0.000 0.339 0.394

The coefficient now looks very different having conditioned on the other confounders. We can see that sales still increase with age, but we don’t know if this relationship is linear or not. We could add extra terms (like age squared) to explore this.

Or, we could use a binned scatterplot to analyse the relationship non-parametrically. The binned scatterplot divides the conditioning variable, age in our example, into equally sized bins or quantiles, and then plots the conditional mean of the dependent variable, sales in our example, within each bin.

We’ll be using the binsreg package for this Cattaneo et al. [2019].

Binned scatterplots do not just provide conditional means for optimally chosen intervals; they can also provide inference for these means. In particular, they will give confidence intervals around each data point. In the binsreg package, the ci option adds confidence intervals to the estimation results.

As the binsreg package is not written in the most Pythonic way, we’ll adorn it a bit to make it more easily used with dataframes.

import binsreg

def binscatter(**kwargs):
# Estimate binsreg
est = binsreg.binsreg(**kwargs, noplot=True)
# Retrieve estimates
df_est = pd.concat([d.dots for d in est.data_plot])
df_est = df_est.rename(columns={'x': kwargs.get("x"), 'fit': kwargs.get("y")})
if "ci" in kwargs:
df_est = pd.merge(df_est, pd.concat([d.ci for d in est.data_plot]))
df_est = df_est.drop(columns=['x'])
df_est['ci'] = df_est['ci_r'] - df_est['ci_l']
# Rename groups
if "by" in kwargs:
df_est['group'] = df_est['group'].astype(kwargs.get("data")[kwargs.get("by")].dtype)
df_est = df_est.rename(columns={'group': kwargs.get("by")})

return df_est

# Estimate binsreg
br_est = binscatter(x='age', y='sales', data=sales, ci=(3,3))

/Users/aet/mambaforge/envs/codeforecon/lib/python3.10/site-packages/binsreg/binsreg.py:783: UserWarning: To speed up computation, bin/degree selection uses a subsample of roughly max(5,000, 0.01n) observations if the sample size n>5,000. To use the full sample, set randcut=1.

group age bin isknot mid sales ci_l ci_r ci
0 Full Sample 0.014120 0 0 0 1562.374625 1245.889702 1689.715525 443.825823
1 Full Sample 0.044370 1 0 0 1740.650743 1576.219312 1960.659363 384.440051
2 Full Sample 0.079516 2 0 0 1792.476949 1576.668323 1952.172833 375.504510
3 Full Sample 0.116433 3 0 0 2064.777874 1916.875465 2413.991609 497.116144
4 Full Sample 0.153176 4 0 0 2428.623364 2080.920773 2569.592233 488.671459

We can now plot the results

(
ggplot(br_est, aes(x="age", y="sales", ymin='ci_l', ymax="ci_r")) +
geom_point() +
geom_errorbar() +
ggtitle("Binscatter: sales as a function of age")
)


We can now see that the relationship is non-linear with a sharp increase in sales at the beginning of the lifetime of a firm, followed by a plateau.

As noted, this relationship may be modified by other variables though. So let’s now condition on those.

To condition on products, we pass w=["products"] to the binscatter() function.

# Estimate binsreg
br_est = binscatter(x='age', y='sales', w=['products'], data=sales, ci=(3,3))

/Users/aet/mambaforge/envs/codeforecon/lib/python3.10/site-packages/binsreg/binsreg.py:783: UserWarning: To speed up computation, bin/degree selection uses a subsample of roughly max(5,000, 0.01n) observations if the sample size n>5,000. To use the full sample, set randcut=1.

group age bin isknot mid sales ci_l ci_r ci
0 Full Sample 0.015318 0 0 0 2076.023370 1855.541497 2288.822512 433.281014
1 Full Sample 0.048655 1 0 0 2224.250215 2031.657039 2405.704773 374.047734
2 Full Sample 0.086865 2 0 0 2348.099214 2151.597400 2521.641938 370.044537
3 Full Sample 0.127547 3 0 0 2641.615108 2415.134361 2873.678452 458.544092
4 Full Sample 0.166449 4 0 0 2595.750136 2513.130901 2964.502585 451.371684

Now let’s look at the plot of this:

(
ggplot(br_est, aes(x="age", y="sales", ymin='ci_l', ymax="ci_r")) +
geom_point() +
geom_errorbar() +
ggtitle("Binscatter: sales as a function of age conditioned on products")
)


Conditional on the number of products, the shape of the sales life-cycle changes further. Now, after an initial increase in sales, we observe a gradual decrease over time.

Do online-only firms have different sales life-cycles with respect to mixed online-offline firms? We can produce different binned scatterplots by group using the option by.

# Estimate binsreg
br_est = binscatter(x='age', y='sales', w=['products'], by="online", data=sales, ci=(3,3))
br_est["online"] = br_est["online"].astype("boolean")

/Users/aet/mambaforge/envs/codeforecon/lib/python3.10/site-packages/binsreg/binsreg.py:940: UserWarning: To speed up computation, bin/degree selection uses a subsample of roughly max(5,000, 0.01n) observations if the sample size n>5,000. To use the full sample, set randcut=1.

online age bin isknot mid sales ci_l ci_r ci
0 False 0.017810 0 0 0 2148.012043 1774.809400 2365.909544 591.100144
1 False 0.056041 1 0 0 2144.905095 1919.921103 2380.367142 460.446039
2 False 0.097980 2 0 0 2206.951051 2028.323642 2500.604182 472.280540
3 False 0.146331 3 0 0 2529.481251 2222.970487 2717.260673 494.290186
4 False 0.191721 4 0 0 2459.073079 2140.625594 2636.301404 495.675810
(
ggplot(br_est, aes(x="age", y="sales", color="online", ymin='ci_l', ymax="ci_r")) +
geom_point() +
geom_errorbar() +
ggtitle("Binscatter: sales as a function of age conditioned on products")
)


From the binned scatterplot, we can see that online products have on average shorter lifetimes, with a higher initial peak in sales, followed by a sharper decline.

Hopefully this shows you how you can use binsreg to get a better understanding of causal relationships in your data.

You can find out more about bin scatters in this video by Paul Goldsmith-Pinkham.

## Specification curve analysis#

When specifying a model, modellers have many options. These can be informed by field intelligence, priors, and even misguided attempts to find a significant result. Even with the best of intentions, research teams can reach entirely different conclusions using the same, or similar, data because of different choices made in preparing data or in modelling it.

There’s formal evidence that researchers really do make different decisions; one study gave the same research question - whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players - to 29 different teams. From the abstract of that paper:

Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability.

So not only were different decisions made, there seems to be no clearly identifiable reason for them. There is usually scope for reasonable alternative model specifications when estimating coefficients, and those coefficients will vary with those specifications.

Specification curve analysis looks for a more exhaustive way of trying out alternative specifications. The three steps of specification curve analysis are:

1. identifying the set of theoretically justified, statistically valid, and non-redundant analytic specifications;

2. displaying alternative results graphically, allowing the identification of decisions producing different results; and

3. conducting statistical tests to determine whether as a whole results are inconsistent with the null hypothesis.

For a good example of specification curve analysis in action, see this recent Nature Human Behaviour paper on the association between adolescent well-being and the use of digital technology.

We’ll use the specification curve analysis package to do the first two, which you can install with pip install specification_curve. To demonstrate the full functionality, we’ll use the “mpg” dataset and create a second, alternative ‘hp’ that is a transformed version of the original.

mpg = pd.read_csv(
"https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv",
dtype={"model": str, "mpg": float, "hp": float, "disp": float, "cyl": "category"},
)

model mpg cyl disp hp drat wt qsec vs am gear carb
0 Mazda RX4 21.0 6 160.0 110.0 3.90 2.620 16.46 0 1 4 4
1 Mazda RX4 Wag 21.0 6 160.0 110.0 3.90 2.875 17.02 0 1 4 4
2 Datsun 710 22.8 4 108.0 93.0 3.85 2.320 18.61 1 1 4 1
3 Hornet 4 Drive 21.4 6 258.0 110.0 3.08 3.215 19.44 1 0 3 1
4 Hornet Sportabout 18.7 8 360.0 175.0 3.15 3.440 17.02 0 0 3 2
mpg["hp_boxcox"], _ = st.boxcox(mpg["hp"])
mpg["lnhp"] = np.log(mpg["hp"])

mpg.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 14 columns):
#   Column     Non-Null Count  Dtype
---  ------     --------------  -----
0   model      32 non-null     object
1   mpg        32 non-null     float64
2   cyl        32 non-null     category
3   disp       32 non-null     float64
4   hp         32 non-null     float64
5   drat       32 non-null     float64
6   wt         32 non-null     float64
7   qsec       32 non-null     float64
8   vs         32 non-null     int64
9   am         32 non-null     int64
10  gear       32 non-null     int64
11  carb       32 non-null     int64
12  hp_boxcox  32 non-null     float64
13  lnhp       32 non-null     float64
dtypes: category(1), float64(8), int64(4), object(1)
memory usage: 3.5+ KB


Now let’s create a specification curve. We need to specify the data, the different outcome variables we’d like to try, y_endog; the different possible versions of the main regressor of interest, x_exog; the possible controls, controls; any controls that should always be included, always_include; and any categorical variables to include class-by-class, cat_expand. Some of these accept lists of variables as well as single reggressors. The point estimates that have confidence intervals which include zero are coloured in grey, instead of blue. There is also an exclu_grps option to exclude certain combinations of regressors, and you can pass alternative estimators to fit, for example fit(estimator=sm.Logit).

import specification_curve as specy

sc = specy.SpecificationCurve(
mpg,
y_endog="mpg",
x_exog=["lnhp", "hp_boxcox"],
controls=["drat", "qsec", "cyl", "gear"],
always_include=["gear"],
cat_expand="cyl",
)
sc.fit()
sc.plot()

Fit complete