Regression diagnostics and visualisations#
Introduction#
In this chapter, you’ll see some diagnostics and visualisations that are closely tied to regressions.
Most of this chapter will rely on the amazing statsmodels package, including for some of the examples. Some of the material in this chapter follows Grant McDermott’s excellent notes and the Library of Statistical Translation.
Imports#
Let’s import some of the packages we’ll be using:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scipy.stats as st
import statsmodels.api as sm
import statsmodels.formula.api as smf
from lets_plot import *
LetsPlot.setup_html()
# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(
seed_for_prng
) # prng=probabilistic random number generator
Regression plots#
statsmodels has a number of built-in plotting methods to help you understand how well your regression is capturing the relationships you’re looking for. Let’s see a few examples of these using statsmodels built-in Statewide Crime data set:
crime_data = sm.datasets.statecrime.load_pandas()
print(sm.datasets.statecrime.NOTE)
::
Number of observations: 51
Number of variables: 8
Variable name definitions:
state
All 50 states plus DC.
violent
Rate of violent crimes / 100,000 population. Includes murder, forcible
rape, robbery, and aggravated assault. Numbers for Illinois and
Minnesota do not include forcible rapes. Footnote included with the
American Statistical Abstract table reads:
"The data collection methodology for the offense of forcible
rape used by the Illinois and the Minnesota state Uniform Crime
Reporting (UCR) Programs (with the exception of Rockford, Illinois,
and Minneapolis and St. Paul, Minnesota) does not comply with
national UCR guidelines. Consequently, their state figures for
forcible rape and violent crime (of which forcible rape is a part)
are not published in this table."
murder
Rate of murders / 100,000 population.
hs_grad
Percent of population having graduated from high school or higher.
poverty
% of individuals below the poverty line
white
Percent of population that is one race - white only. From 2009 American
Community Survey
single
Calculated from 2009 1-year American Community Survey obtained obtained
from Census. Variable is Male householder, no wife present, family
household combined with Female householder, no husband present, family
household, divided by the total number of Family households.
urban
% of population in Urbanized Areas as of 2010 Census. Urbanized
Areas are area of 50,000 or more people.
First, let’s look at a Q-Q plot to get a sense of how the variables are distributed. This uses scipy’s stats module. The default distribution is normal but you can use any that scipy supports.
st.probplot(crime_data.data["murder"], dist="norm", plot=plt);
Clearly, this is not quite normal and there are some serious outliers in the tails.
Let’s run take a look at the unconditional relationship we’re interested in: how murder depends on high school graduation. We’ll use lets-plot’s geom_smooth()
to do this but bear in mind it will only run a linear model of 'murder ~ hs_grad'
and ignore the other covariates.
p = (
ggplot(crime_data.data, aes(y="murder", x="hs_grad"))
+ geom_point()
+ geom_smooth(method="lm")
)
p.show()
We can take into account those other factors by using a partial regression plot that asks what does \(\mathbb{E}(y|X)\) look like as a function of \(\mathbb{E}(x_i|X)\)? (Use obs_labels=False
to remove data point labels.)
with plt.rc_context({"font.size": 5}):
sm.graphics.plot_partregress(
endog="murder",
exog_i="hs_grad",
exog_others=["urban", "poverty", "single"],
data=crime_data.data,
obs_labels=True,
)
plt.show()
At this point, the results of the regression are useful context.
results_crime = smf.ols(
"murder ~ hs_grad + urban + poverty + single", data=crime_data.data
).fit()
print(results_crime.summary())
OLS Regression Results
==============================================================================
Dep. Variable: murder R-squared: 0.813
Model: OLS Adj. R-squared: 0.797
Method: Least Squares F-statistic: 50.08
Date: Fri, 03 Jan 2025 Prob (F-statistic): 3.42e-16
Time: 01:02:23 Log-Likelihood: -95.050
No. Observations: 51 AIC: 200.1
Df Residuals: 46 BIC: 209.8
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -44.1024 12.086 -3.649 0.001 -68.430 -19.774
hs_grad 0.3059 0.117 2.611 0.012 0.070 0.542
urban 0.0109 0.015 0.707 0.483 -0.020 0.042
poverty 0.4121 0.140 2.939 0.005 0.130 0.694
single 0.6374 0.070 9.065 0.000 0.496 0.779
==============================================================================
Omnibus: 1.618 Durbin-Watson: 2.507
Prob(Omnibus): 0.445 Jarque-Bera (JB): 0.831
Skew: -0.220 Prob(JB): 0.660
Kurtosis: 3.445 Cond. No. 5.80e+03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.8e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Putting the multicollinearity problems to one side, we see that the relationship shown in the partial regression plot is also implied by the coefficient on hs_grad
in the regression table.
We can also look at an in-depth summary of one exogenous regressor and its relationship to the outcome variable. Each of these types of regression diagnostic are available individually, or for all regressors at once, too. The first panel is the chart we did with lets-plot rendered differently (and, one could argue, more informatively). Most of the plots below are self-explanatory except for the third one, the CCPR (Component-Component plus Residual) plot. This provides a way to judge the effect of one regressor on the response variable by taking into account the effects of the other independent variables.
fig = plt.figure(figsize=(8, 6), dpi=150)
sm.graphics.plot_regress_exog(results_crime, "hs_grad", fig=fig)
plt.show()
statsmodels can also produce influence plots of the ‘externally studentised’ residuals vs. the leverage of each observation as measured by the so-called hat matrix \(X(X^{\;\prime}X)^{-1}X^{\;\prime}\) (because it puts the ‘hat’ on \(y\)). Externally studentised residuals are residuals that are scaled by their standard deviation. High leverage points could exert an undue influence over the regression line, but only if the predicted \(y\) values of a regression that was fit with them excluded was quite different. In the example below, DC is having a big influence.
with plt.rc_context({"font.size": 6}):
sm.graphics.influence_plot(results_crime)
Finally, it’s nice to be able to see plots of our coefficients along with their standard errors. There isn’t a built-in statsmodels option for this, but happily it’s easy to extract the results of regressions in a sensible format. Using the results
object from earlier, and excluding the intercept, we can get the coefficients from results.params[1:]
and the associated errors from results.bse[1:]
.
# Put the results into a dataframe with Name, Coefficient, Error
res_df = (
pd.concat([results_crime.params[1:], results_crime.bse[1:]], axis=1)
.reset_index()
.rename(columns={"index": "Name", 0: "Coefficient", 1: "Error"})
)
res_df["ymax"] = res_df["Coefficient"] + res_df["Error"]
res_df["ymin"] = res_df["Coefficient"] - res_df["Error"]
# Plot the coefficient values and their errors
(
ggplot(res_df)
+ geom_point(aes("Name", "Coefficient"))
+ geom_errorbar(aes(x="Name", ymin="ymin", ymax="ymax"))
).show()
Binscatter#
We’re going to follow the excellent example in this blog post by Matteo Courthoud.
First we’ll generate some fake data:
def dgp_marketplace(N=10_000):
# Does the firm sells only online?
online = prng.binomial(1, 0.5, N)
# How many products does the firm have
products = 1 + prng.poisson(1, N)
# What is the age of the firm
t = prng.exponential(0.5 * products, N)
# Sales
sales = 1e3 * prng.exponential(
products
+ np.maximum(
(1 + 0.3 * products + 4 * online) * t - 0.5 * (1 + 6 * online) * t**2, 0
),
N,
)
# Generate the dataframe
df = pd.DataFrame(
{"age": t, "sales": sales, "online": online, "products": products}
)
return df
sales = dgp_marketplace(N=10_000)
sales.head()
age | sales | online | products | |
---|---|---|---|---|
0 | 0.215330 | 1476.840811 | 0 | 2 |
1 | 0.306930 | 1624.207274 | 1 | 1 |
2 | 0.791730 | 6721.358263 | 1 | 3 |
3 | 0.157610 | 5520.048141 | 0 | 1 |
4 | 1.316726 | 1916.905960 | 0 | 3 |
We have pulled down information on 10,000 firms. For each firm we know:
age
: the age of the firmsales
: the monthly sales from last monthonline
: whether the firm is only active onlineproducts
: the number of products that the firm offers
Suppose we are interested in understanding the relationship between age and sales. What is the life-cycle of sales?
Let’s start with a simple scatterplot of sales vs age.
(ggplot(sales, aes(x="age", y="sales")) + geom_point(color="blue", alpha=0.4)).show()
This plot is far too crowded to get a real sense of the relationship here.
A linear regression can help us unpick what the relationship really is
smf.ols("np.log(sales) ~ np.log(age)", sales).fit().summary().tables[1]
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 7.4012 | 0.015 | 487.440 | 0.000 | 7.371 | 7.431 |
np.log(age) | 0.1531 | 0.010 | 15.777 | 0.000 | 0.134 | 0.172 |
There is a pretty strong positive relationship here that wasn’t evident in the scatter plot. BUT it’s possible that this relationships depends on whether the firms are online-only or not. We can condition on the other variables and do another regression:
smf.ols(
"np.log(sales) ~ np.log(age) + C(online) + products", sales
).fit().summary().tables[1]
coef | std err | t | P>|t| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 6.5121 | 0.036 | 178.775 | 0.000 | 6.441 | 6.584 |
C(online)[T.1] | 0.1877 | 0.026 | 7.186 | 0.000 | 0.137 | 0.239 |
np.log(age) | 0.0645 | 0.010 | 6.477 | 0.000 | 0.045 | 0.084 |
products | 0.3665 | 0.014 | 26.256 | 0.000 | 0.339 | 0.394 |
The coefficient now looks very different having conditioned on the other confounders. We can see that sales still increase with age, but we don’t know if this relationship is linear or not. We could add extra terms (like age squared) to explore this.
Or, we could use a binned scatterplot to analyse the relationship non-parametrically. The binned scatterplot divides the conditioning variable, age
in our example, into equally sized bins or quantiles, and then plots the conditional mean of the dependent variable, sales
in our example, within each bin.
We’ll be using the binsreg package for this Cattaneo et al. [2019].
Binned scatterplots do not just provide conditional means for optimally chosen intervals; they can also provide inference for these means. In particular, they will give confidence intervals around each data point. In the binsreg package, the ci
option adds confidence intervals to the estimation results.
As the binsreg package is not written in the most Pythonic way, we’ll adorn it a bit to make it more easily used with dataframes.
import binsreg
def binscatter(**kwargs):
# Estimate binsreg
est = binsreg.binsreg(**kwargs, noplot=True)
# Retrieve estimates
df_est = pd.concat([d.dots for d in est.data_plot])
df_est = df_est.rename(columns={"x": kwargs.get("x"), "fit": kwargs.get("y")})
# Add confidence intervals
if "ci" in kwargs:
df_est = pd.merge(df_est, pd.concat([d.ci for d in est.data_plot]))
df_est = df_est.drop(columns=["x"])
df_est["ci"] = df_est["ci_r"] - df_est["ci_l"]
# Rename groups
if "by" in kwargs:
df_est["group"] = df_est["group"].astype(
kwargs.get("data")[kwargs.get("by")].dtype
)
df_est = df_est.rename(columns={"group": kwargs.get("by")})
return df_est
# Estimate binsreg
br_est = binscatter(x="age", y="sales", data=sales, ci=(3, 3))
br_est.head()
/home/runner/micromamba/envs/codeforecon/lib/python3.10/site-packages/binsreg/binsreg.py:784: UserWarning: To speed up computation, bin/degree selection uses a subsample of roughly max(5,000, 0.01n) observations if the sample size n>5,000. To use the full sample, set randcut=1.
group | age | bin | isknot | mid | sales | ci_l | ci_r | ci | |
---|---|---|---|---|---|---|---|---|---|
0 | Full Sample | 0.014120 | 0 | 0 | 0 | 1562.374625 | 1245.889702 | 1689.715525 | 443.825823 |
1 | Full Sample | 0.044370 | 1 | 0 | 0 | 1740.650743 | 1576.219312 | 1960.659363 | 384.440051 |
2 | Full Sample | 0.079516 | 2 | 0 | 0 | 1792.476949 | 1576.668323 | 1952.172833 | 375.504510 |
3 | Full Sample | 0.116433 | 3 | 0 | 0 | 2064.777874 | 1916.875465 | 2413.991609 | 497.116144 |
4 | Full Sample | 0.153176 | 4 | 0 | 0 | 2428.623364 | 2080.920773 | 2569.592233 | 488.671459 |
We can now plot the results
(
ggplot(br_est, aes(x="age", y="sales", ymin="ci_l", ymax="ci_r"))
+ geom_point()
+ geom_errorbar()
+ ggtitle("Binscatter: sales as a function of age")
)
We can now see that the relationship is non-linear with a sharp increase in sales
at the beginning of the lifetime of a firm, followed by a plateau.
As noted, this relationship may be modified by other variables though. So let’s now condition on those.
To condition on products
, we pass w=["products"]
to the binscatter()
function.
# Estimate binsreg
br_est = binscatter(x="age", y="sales", w=["products"], data=sales, ci=(3, 3))
br_est.head()
/home/runner/micromamba/envs/codeforecon/lib/python3.10/site-packages/binsreg/binsreg.py:784: UserWarning: To speed up computation, bin/degree selection uses a subsample of roughly max(5,000, 0.01n) observations if the sample size n>5,000. To use the full sample, set randcut=1.
group | age | bin | isknot | mid | sales | ci_l | ci_r | ci | |
---|---|---|---|---|---|---|---|---|---|
0 | Full Sample | 0.013602 | 0 | 0 | 0 | 2087.953373 | 1831.700489 | 2297.705257 | 466.004767 |
1 | Full Sample | 0.042516 | 1 | 0 | 0 | 2167.297481 | 2007.649389 | 2389.329550 | 381.680160 |
2 | Full Sample | 0.076085 | 2 | 0 | 0 | 2279.090462 | 2104.168484 | 2476.730841 | 372.562358 |
3 | Full Sample | 0.111670 | 3 | 0 | 0 | 2437.261875 | 2284.513572 | 2745.773708 | 461.260136 |
4 | Full Sample | 0.147252 | 4 | 0 | 0 | 2849.811871 | 2501.530277 | 2986.581754 | 485.051477 |
Now let’s look at the plot of this:
(
ggplot(br_est, aes(x="age", y="sales", ymin="ci_l", ymax="ci_r"))
+ geom_point()
+ geom_errorbar()
+ ggtitle("Binscatter: sales as a function of age conditioned on products")
)
Conditional on the number of products
, the shape of the sales
life-cycle changes further. Now, after an initial increase in sales
, we observe a gradual decrease over time.
Do online-only firms have different sales
life-cycles with respect to mixed online-offline firms? We can produce different binned scatterplots by group using the option by
.
# Estimate binsreg
br_est = binscatter(
x="age", y="sales", w=["products"], by="online", data=sales, ci=(3, 3)
)
br_est["online"] = br_est["online"].astype("boolean")
br_est.head()
/home/runner/micromamba/envs/codeforecon/lib/python3.10/site-packages/binsreg/binsreg.py:941: UserWarning: To speed up computation, bin/degree selection uses a subsample of roughly max(5,000, 0.01n) observations if the sample size n>5,000. To use the full sample, set randcut=1.
online | age | bin | isknot | mid | sales | ci_l | ci_r | ci | |
---|---|---|---|---|---|---|---|---|---|
0 | False | 0.017810 | 0 | 0 | 0 | 2148.012043 | 1774.809400 | 2365.909544 | 591.100144 |
1 | False | 0.056041 | 1 | 0 | 0 | 2144.905095 | 1919.921103 | 2380.367142 | 460.446039 |
2 | False | 0.097980 | 2 | 0 | 0 | 2206.951051 | 2028.323642 | 2500.604182 | 472.280540 |
3 | False | 0.146331 | 3 | 0 | 0 | 2529.481251 | 2222.970487 | 2717.260673 | 494.290186 |
4 | False | 0.191721 | 4 | 0 | 0 | 2459.073079 | 2140.625594 | 2636.301404 | 495.675810 |
(
ggplot(br_est, aes(x="age", y="sales", color="online", ymin="ci_l", ymax="ci_r"))
+ geom_point()
+ geom_errorbar()
+ ggtitle("Binscatter: sales as a function of age conditioned on products")
)
From the binned scatterplot, we can see that online products have on average shorter lifetimes, with a higher initial peak in sales, followed by a sharper decline.
Hopefully this shows you how you can use binsreg to get a better understanding of causal relationships in your data.
You can find out more about bin scatters in this video by Paul Goldsmith-Pinkham.
Specification curve analysis#
When specifying a model, modellers have many options. These can be informed by field intelligence, priors, and even misguided attempts to find a significant result. Even with the best of intentions, research teams can reach entirely different conclusions using the same, or similar, data because of different choices made in preparing data or in modelling it.
There’s formal evidence that researchers really do make different decisions; one study [Silberzahn, Uhlmann, Martin, Anselmi, Aust, Awtrey, Bahník, Bai, Bannard, Bonnier, and others, 2018] gave the same research question - whether soccer referees are more likely to give red cards to dark-skin-toned players than to light-skin-toned players - to 29 different teams. From the abstract of that paper:
Analytic approaches varied widely across the teams, and the estimated effect sizes ranged from 0.89 to 2.93 (Mdn = 1.31) in odds-ratio units. Twenty teams (69%) found a statistically significant positive effect, and 9 teams (31%) did not observe a significant relationship. Overall, the 29 different analyses used 21 unique combinations of covariates. Neither analysts’ prior beliefs about the effect of interest nor their level of expertise readily explained the variation in the outcomes of the analyses. Peer ratings of the quality of the analyses also did not account for the variability.
So not only were different decisions made, there seems to be no clearly identifiable reason for them. There is usually scope for reasonable alternative model specifications when estimating coefficients, and those coefficients will vary with those specifications.
Specification curve analysis [Simonsohn, Simmons, and Nelson, 2020] looks for a more exhaustive way of trying out alternative specifications. The three steps of specification curve analysis are:
identifying the set of theoretically justified, statistically valid, and non-redundant analytic specifications;
displaying alternative results graphically, allowing the identification of decisions producing different results; and
conducting statistical tests to determine whether as a whole results are inconsistent with the null hypothesis.
For a good example of specification curve analysis in action, see this recent Nature Human Behaviour paper [Orben and Przybylski, 2019] on the association between adolescent well-being and the use of digital technology.
We’ll use the specification curve analysis package to do the first two, which you can install with pip install specification_curve
. To demonstrate the full functionality, we’ll use the “mpg” dataset and create a second, alternative ‘hp’ that is a transformed version of the original.
mpg = pd.read_csv(
"https://raw.githubusercontent.com/LOST-STATS/lost-stats.github.io/source/Data/mtcars.csv",
dtype={"model": str, "mpg": float, "hp": float, "disp": float, "cyl": "category"},
)
mpg.head()
model | mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Mazda RX4 | 21.0 | 6 | 160.0 | 110.0 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
1 | Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110.0 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
2 | Datsun 710 | 22.8 | 4 | 108.0 | 93.0 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
3 | Hornet 4 Drive | 21.4 | 6 | 258.0 | 110.0 | 3.08 | 3.215 | 19.44 | 1 | 0 | 3 | 1 |
4 | Hornet Sportabout | 18.7 | 8 | 360.0 | 175.0 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
mpg["hp_boxcox"], _ = st.boxcox(mpg["hp"])
mpg["lnhp"] = np.log(mpg["hp"])
mpg.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 model 32 non-null object
1 mpg 32 non-null float64
2 cyl 32 non-null category
3 disp 32 non-null float64
4 hp 32 non-null float64
5 drat 32 non-null float64
6 wt 32 non-null float64
7 qsec 32 non-null float64
8 vs 32 non-null int64
9 am 32 non-null int64
10 gear 32 non-null int64
11 carb 32 non-null int64
12 hp_boxcox 32 non-null float64
13 lnhp 32 non-null float64
dtypes: category(1), float64(8), int64(4), object(1)
memory usage: 3.5+ KB
Now let’s create a specification curve. We need to specify the data, the different outcome variables we’d like to try, y_endog
; the different possible versions of the main regressor of interest, x_exog
; the possible controls, controls
; any controls that should always be included, always_include
; and any categorical variables to include class-by-class, cat_expand
. Some of these accept lists of variables as well as single reggressors. The point estimates that have confidence intervals which include zero are coloured in grey, instead of blue. There is also an exclu_grps
option to exclude certain combinations of regressors, and you can pass alternative estimators to fit, for example fit(estimator=sm.Logit)
.
import specification_curve as specy
sc = specy.SpecificationCurve(
mpg,
y_endog="mpg",
x_exog=["lnhp", "hp_boxcox"],
controls=["drat", "qsec", "cyl", "gear"],
always_include=["gear"],
cat_expand="cyl",
)
sc.fit()
sc.plot()
Fit complete