import specification_curve as sc
= sc.load_example_data1()
df = "y1" # endogeneous variable
y_endog = "x1" # exogeneous variable
x_exog = ["c1", "c2", "group1", "group2"]
controls = sc.SpecificationCurve(
sco
df,
y_endog,
x_exog,
controls,
)
sco.fit() sco.plot()
Features
Let’s see some of Specification Curve’s features in action.
Basic Use
Here’s an example of using Specification Curve. Note that, in the below, we can pass strings or lists of string into the arguments of the class SpecificationCurve
. The programme then automatically performs all of possible regressions of endogeneous variables on exogeneous variables and controls. The estimate that is picked out is the coefficient on the given combination of endogeneous and exogenous variables (with conditioning on the given controls).
If a control variable is categorical, rather than continuous, it will be treated as a fixed effect.
Grey squares (black lines when there are many specifications) show whether a variable is included in a specification or not. Blue or red markers and error bars show whether the coefficient is positive and significant (at the 0.05 level) or red and significant, respectively.
You can also specify models with the formula API. An example string would be “y1 | y2 ~ x1 | x2 | x3 + c1 + c2 + c3 | c4”. This would produce a specification curve in which there two endogenous variables, two exogenous variables, two variables that are always included (c1 and c2), and two controls that are included in all possible combinations.
Here’s an example, for which we first generate some data.
import numpy as np
import pandas as pd
import statsmodels.api as sm
# Set seed for random numbers
= 78557
seed_for_prng # prng=probabilistic random number generator
= np.random.default_rng(seed_for_prng)
prng
# generate some fake data
= 1000
n_samples = prng.integers(2, size=n_samples)
x_2 = prng.random(size=n_samples)
x_1 = prng.integers(3, size=n_samples)
x_3 = prng.random(size=n_samples)
x_4 = x_1 + 0.05 * np.random.randn(n_samples)
x_5 = -1 - 3.5 * x_1 + 0.2 * x_2 + 0.3 * x_3 # NB: coefficient is -3.5
x_beta = 1 / (1 + np.exp(-x_beta))
prob = prng.binomial(n=1, p=prob, size=n_samples)
y = prng.binomial(n=1, p=prob * 0.98, size=n_samples)
y2 = pd.DataFrame(
df_logit "x_1", "x_2", "x_3", "x_4", "x_5", "y", "y2"]
[x_1, x_2, x_3, x_4, x_5, y, y2], [ ).T
= sc.SpecificationCurve(
sc_form ="y | y2 ~ x_1 | x_5 + x_2 + x_3 | x_4"
df_logit, formula
)
sc_form.fit() sc_form.plot()
Note that “x_2” is included in all specifications.
Retrieving estimates
You can retrieve the estimates from the data frame:
= sc.SpecificationCurve(df, y_endog, x_exog, controls)
sco
sco.fit() sco.df_r.head()
x_exog | y_endog | Results | Coefficient | Specification | bse | conf_int | pvalues | SpecificationCounts | |
---|---|---|---|---|---|---|---|---|---|
Specification No. | |||||||||
0 | x1 | y1 | <statsmodels.regression.linear_model.Regressio... | 6.205962 | [c1, c2, x1, y1] | 0.385317 | [5.448896340566617, 6.963027714134404] | {'const': 1.2704500980836043e-23, 'x1': 3.3571... | {'c1': 1, 'c2': 1, 'x1': 1, 'y1': 1} |
1 | x1 | y1 | <statsmodels.regression.linear_model.Regressio... | 6.205962 | [c1, c2, group1, group2, x1, y1] | 0.385317 | [5.448896340566617, 6.963027714134404] | {'const': 1.2704500980836043e-23, 'x1': 3.3571... | {'c1': 1, 'c2': 1, 'group1': 1, 'group2': 1, '... |
2 | x1 | y1 | <statsmodels.regression.linear_model.Regressio... | 6.205962 | [c1, c2, group2, x1, y1] | 0.385317 | [5.448896340566617, 6.963027714134404] | {'const': 1.2704500980836043e-23, 'x1': 3.3571... | {'c1': 1, 'c2': 1, 'group2': 1, 'x1': 1, 'y1': 1} |
3 | x1 | y1 | <statsmodels.regression.linear_model.Regressio... | 6.205962 | [c1, c2, group1, x1, y1] | 0.385317 | [5.448896340566617, 6.963027714134404] | {'const': 1.2704500980836043e-23, 'x1': 3.3571... | {'c1': 1, 'c2': 1, 'group1': 1, 'x1': 1, 'y1': 1} |
4 | x1 | y1 | <statsmodels.regression.linear_model.Regressio... | 6.492386 | [c1, group1, group2, x1, y1] | 0.592879 | [5.3275113394582725, 7.657260911796481] | {'const': 0.7410410238698457, 'x1': 3.92271332... | {'c1': 1, 'group1': 1, 'group2': 1, 'x1': 1, '... |
Inference under the null
Although not everyone is convinced, it may be informative to run some statistical inferences on the specification curve. These ask: considering the full set of reasonable specifications jointly, how inconsistent are the results with the null hypothesis of no effect? You can find more details in [@simonsohn2020specification].
= sc.load_example_data3()
dfn dfn.head()
y1 | y2 | x1 | c1 | c2 | c3 | ccat | |
---|---|---|---|---|---|---|---|
0 | 0.408694 | -0.677861 | 0.319844 | 0.873945 | 0.315051 | 0.317730 | 0 |
1 | 0.823288 | 0.220807 | 0.879205 | 0.603677 | 0.044192 | 0.313405 | 0 |
2 | 1.314995 | 1.100182 | 0.980309 | 0.937964 | 0.790195 | 0.684977 | 1 |
3 | 0.481942 | 0.305384 | 0.210092 | 0.718184 | 0.079917 | 0.719236 | 1 |
4 | 1.013794 | 0.773417 | 0.267960 | 0.239086 | 0.789400 | 0.279626 | 1 |
As normal, we create a specification curve object and fit it.
= sc.SpecificationCurve(
sco =["y1", "y2"], x_exog="x1", controls=["c1", "c2", "c3"]
dfn, y_endog
) sco.fit()
Now, to look at the fit under the null, we can choose to either fit it directly, using .fit_null()
or to use .plot(show_null_curve=True)
, which will run the inference automatically and plot it. For either case, you can also pass the number of bootstraps as an argument: for example, .fit_null(n_boot=10)
or .plot(show_null_curve=True, **{"n_boot": 10})
respectively.
You should really use a large number of bootstraps (eg 500) but be warned it takes a long time to run this many.
Let’s fit the null first:
=10) sco.fit_null(n_boot
And now we can plot it to see what the likely range of the coefficient would be under the null:
=True) sco.plot(show_null_curve
Finally, there is a summary of some statistical tests that are relevant to whether coefficient under the null is zero (hypothesis) or not, and whether the share of positives and negatives is what you would expect from chance or not.
sco.null_stats_summary
estimate | p-value | |
---|---|---|
median | 0.00593 | <0.001 |
share positive | 16 of 16 | NA |
share negative | 0 of 16 | NA |
And, once you have done all of this fitting, you can examine it all simply by calling the object:
sco
--------------------------
Specification Curve object
--------------------------
y vars: y1, y2
x vars: x1
controls: c1, c2, c3
always included:
Specifications have run
-----------------------
Estimator: <class 'statsmodels.regression.linear_model.OLS'>
No. specifications: 16
Coefficient stats:
min median max
0.399 0.4 0.401
Coeffs under null have run
--------------------------
estimate p-value
median 0.00593 <0.001
share positive 16 of 16 NA
share negative 0 of 16 NA
Saving results to file
Save the plot to file (the format is inferred from file extension):
= sc.SpecificationCurve(df, y_endog, x_exog, controls,
sco =['group1'])
cat_expand
sco.fit()='test_fig.pdf') sco.plot(save_path
Expanding a categorical variable
Should you need to, you can expand a categorical variable into its different elements and run those separately. In the example below, the "group2"
categorical variable is expanded like this.
= "y1" # endogeneous variable
y_endog = "x1" # exogeneous variable
x_exog = ["c1", "c2", "group1", "group2"]
controls = sc.SpecificationCurve(
sco
df,
y_endog,
x_exog,
controls,=["group2"], # have each fixed effect run separately
cat_expand
)
sco.fit() sco.plot()
Using multiple exogeneous variables
Sometimes, you’d like to check different independent variables (and the coefficients they come with following a regression). This is achieved by passing a list to the exogeneous argument of SpecificationCurve
. These variations on the independent variables are labelled by x
in the plot.
= sc.load_example_data1()
df = ["x1", "x2"]
x_exog = "y1"
y_endog = ["c1", "c2", "group1", "group2"]
controls = sc.SpecificationCurve(df, y_endog, x_exog, controls)
sco
sco.fit() sco.plot()
Excluding some combinations of controls
Some controls may be redundant, and you might want to exclude them both being used together. The exclu_grps
keyword argument achieves this.
In the below example, "c1"
and "c2"
are never run in the same specification.
= sc.load_example_data1()
df
= "y1"
y_endog = "x1"
x_exog = ["c1", "c2", "group1", "group2"]
controls = sc.SpecificationCurve(df, y_endog, x_exog, controls, exclu_grps=[["c1", "c2"]])
sco
sco.fit() sco.plot()
Always include some controls in all specifications
Likewise, there will be times when you always wish to include a particular control in specifications, and to show this on the plot. The always_include=
keyword argument helps you to achieve this.
In the example below, we ask that "c1"
is included in every specification.
= sc.load_example_data1()
df = "x1"
x_exog = "y1"
y_endog = ["c2", "group1", "group2"]
controls = sc.SpecificationCurve(df, y_endog, x_exog, controls, always_include="c1")
sco
sco.fit() sco.plot()
Flexing the style for very large numbers of specifications
The default plot type isn’t suitable for very large numbers of specifications, but it does automatically switch to a style suited to a large number of specifications.
Here’s an example
# Generate some fake data
= 2000
n_samples # Number of dimensions of continuous
= 8
n_dim = prng.random(size=(n_dim, n_samples))
c_rnd_vars = [f"c_{i}" for i in range(np.shape(c_rnd_vars)[0])]
c_rnd_vars_names = (
y_1 0.4 * c_rnd_vars[0, :] # This is the true value of the coefficient
- 0.2 * c_rnd_vars[1, :]
+ 0.3 * prng.standard_normal(n_samples)
)# Next line causes y_2 ests to be much more noisy
= y_1 - 0.3 * np.abs(prng.standard_normal(n_samples))
y_2 = pd.DataFrame([y_1, y_2], ["y1", "y2"]).T
df for i, col_name in enumerate(c_rnd_vars_names):
= c_rnd_vars[i, :]
df[col_name]
= c_rnd_vars_names[1:]
controls
# Run it with Specification Curve
= sc.SpecificationCurve(df, ["y1", "y2"], c_rnd_vars_names[0], controls)
sco
sco.fit() sco.plot()
Flagging a preferred specification
Often, in practice, you will have a preferred specification that you will use as your estimate. You can specify this and have it be flagged.
You can achieve this by passing a list of variables that you’d like to be used in your preferred specification via the preferred_spec
keyword argument.
In the example below, the preferred specification comes out as being close to the known answer that we constructed.
= sc.SpecificationCurve(df, ["y1", "y2"], c_rnd_vars_names[0], controls)
sco
sco.fit()=["y1", c_rnd_vars_names[0]] + controls) sco.plot(preferred_spec
Using models other than Ordinary Least Squares
The default model is OLS, but you can pass through other models too.
# Specify the regressions to run
= ["y", "y2"]
y_endog = ["x_1", "x_5"]
x_exog = ["x_3", "x_2", "x_4"]
controls = sc.SpecificationCurve(df_logit, y_endog, x_exog, controls)
sco # Fit using the logit estimator
=sm.Logit) # sm.Probit also works
sco.fit(estimator sco.plot()
Optimization terminated successfully.
Current function value: 0.350088
Iterations 7
Optimization terminated successfully.
Current function value: 0.346934
Iterations 7
Optimization terminated successfully.
Current function value: 0.349797
Iterations 7
Optimization terminated successfully.
Current function value: 0.347686
Iterations 7
Optimization terminated successfully.
Current function value: 0.346750
Iterations 7
Optimization terminated successfully.
Current function value: 0.344251
Iterations 7
Optimization terminated successfully.
Current function value: 0.347407
Iterations 7
Optimization terminated successfully.
Current function value: 0.344075
Iterations 7
Optimization terminated successfully.
Current function value: 0.352582
Iterations 7
Optimization terminated successfully.
Current function value: 0.349318
Iterations 7
Optimization terminated successfully.
Current function value: 0.352264
Iterations 7
Optimization terminated successfully.
Current function value: 0.350093
Iterations 7
Optimization terminated successfully.
Current function value: 0.349110
Iterations 7
Optimization terminated successfully.
Current function value: 0.346530
Iterations 7
Optimization terminated successfully.
Current function value: 0.349782
Iterations 7
Optimization terminated successfully.
Current function value: 0.346324
Iterations 7
Optimization terminated successfully.
Current function value: 0.345884
Iterations 7
Optimization terminated successfully.
Current function value: 0.342386
Iterations 7
Optimization terminated successfully.
Current function value: 0.344798
Iterations 7
Optimization terminated successfully.
Current function value: 0.345878
Iterations 7
Optimization terminated successfully.
Current function value: 0.341527
Iterations 7
Optimization terminated successfully.
Current function value: 0.342360
Iterations 7
Optimization terminated successfully.
Current function value: 0.344793
Iterations 7
Optimization terminated successfully.
Current function value: 0.341502
Iterations 7
Optimization terminated successfully.
Current function value: 0.345083
Iterations 7
Optimization terminated successfully.
Current function value: 0.341444
Iterations 7
Optimization terminated successfully.
Current function value: 0.343892
Iterations 7
Optimization terminated successfully.
Current function value: 0.345071
Iterations 7
Optimization terminated successfully.
Current function value: 0.340488
Iterations 7
Optimization terminated successfully.
Current function value: 0.341403
Iterations 7
Optimization terminated successfully.
Current function value: 0.343882
Iterations 7
Optimization terminated successfully.
Current function value: 0.340448
Iterations 7