Let’s see some of Specification Curve’s features in action.
Basic Use
Here’s an example of using Specification Curve. Note that, in the below, we can pass strings or lists of string into the arguments of the class SpecificationCurve. The programme then automatically performs all of possible regressions of endogeneous variables on exogeneous variables and controls. The estimate that is picked out is the coefficient on the given combination of endogeneous and exogenous variables (with conditioning on the given controls).
If a control variable is categorical, rather than continuous, it will be treated as a fixed effect.
Grey squares (black lines when there are many specifications) show whether a variable is included in a specification or not. Blue or red markers and error bars show whether the coefficient is positive and significant (at the 0.05 level) or red and significant, respectively.
Retrieving estimates
You can retrieve the estimates from the data frame:
Should you need to, you can expand a categorical variable into its different elements and run those separately. In the example below, the "group2" categorical variable is expanded like this.
y_endog ="y1"# endogeneous variablex_exog ="x1"# exogeneous variablecontrols = ["c1", "c2", "group1", "group2"]sc = specy.SpecificationCurve( df, y_endog, x_exog, controls, cat_expand=["group2"], # have each fixed effect run separately)sc.fit()sc.plot()
Fit complete
Using multiple exogeneous variables
Sometimes, you’d like to check different independent variables (and the coefficients they come with following a regression). This is achieved by passing a list to the exogeneous argument of SpecificationCurve. These variations on the independent variables are labelled by x in the plot.
Always include some controls in all specifications
Likewise, there will be times when you always wish to include a particular control in specifications, and to show this on the plot. The always_include= keyword argument helps you to achieve this.
In the example below, we ask that "c1" is included in every specification.
Flexing the style for very large numbers of specifications
The default plot type isn’t suitable for very large numbers of specifications, but it does automatically switch to a style suited to a large number of specifications.
Here’s an example
import numpy as npimport pandas as pd# Set seed for random numbersseed_for_prng =78557# prng=probabilistic random number generatorprng = np.random.default_rng(seed_for_prng)
# Generate some fake datan_samples =400# Number of dimensions of continuousn_dim =8c_rnd_vars = prng.random(size=(n_dim, n_samples))c_rnd_vars_names = [f"c_{i}"for i inrange(np.shape(c_rnd_vars)[0])]y_1 = (0.4* c_rnd_vars[0, :] # This is the true value of the coefficient-0.2* c_rnd_vars[1, :]+0.3* prng.standard_normal(n_samples))# Next line causes y_2 ests to be much more noisyy_2 = y_1 -0.3* np.abs(prng.standard_normal(n_samples))df = pd.DataFrame([y_1, y_2], ["y1", "y2"]).Tfor i, col_name inenumerate(c_rnd_vars_names): df[col_name] = c_rnd_vars[i, :]controls = c_rnd_vars_names[1:]# Run it with Specification Curvesc = specy.SpecificationCurve(df, ["y1", "y2"], c_rnd_vars_names[0], controls)sc.fit()sc.plot()
Fit complete
Flagging a preferred specification
Often, in practice, you will have a preferred specification that you will use as your estimate. You can specify this and have it be flagged.
You can achieve this by passing a list of variables that you’d like to be used in your preferred specification via the preferred_spec keyword argument.
In the example below, the preferred specification comes out as being close to the known answer that we constructed.
The default model is OLS, but you can pass through other models too.
import statsmodels.api as sm# generate some fake datan_samples =1000x_2 = prng.integers(2, size=n_samples)x_1 = prng.random(size=n_samples)x_3 = prng.integers(3, size=n_samples)x_4 = prng.random(size=n_samples)x_5 = x_1 +0.05* np.random.randn(n_samples)x_beta =-1-3.5* x_1 +0.2* x_2 +0.3* x_3 # NB: coefficient is -3.5prob =1/ (1+ np.exp(-x_beta))y = prng.binomial(n=1, p=prob, size=n_samples)y2 = prng.binomial(n=1, p=prob *0.98, size=n_samples)df = pd.DataFrame( [x_1, x_2, x_3, x_4, x_5, y, y2], ["x_1", "x_2", "x_3", "x_4", "x_5", "y", "y2"]).T# Specify the regressions to runy_endog = ["y", "y2"]x_exog = ["x_1", "x_5"]controls = ["x_3", "x_2", "x_4"]sc = specy.SpecificationCurve(df, y_endog, x_exog, controls)# Fit using the logit estimatorsc.fit(estimator=sm.Logit) # sm.Probit also workssc.plot()
Optimization terminated successfully.
Current function value: 0.325863
Iterations 7
Optimization terminated successfully.
Current function value: 0.324867
Iterations 7
Optimization terminated successfully.
Current function value: 0.325858
Iterations 7
Optimization terminated successfully.
Current function value: 0.322285
Iterations 7
Optimization terminated successfully.
Current function value: 0.324788
Iterations 7
Optimization terminated successfully.
Current function value: 0.317532
Iterations 7
Optimization terminated successfully.
Current function value: 0.321525
Iterations 7
Optimization terminated successfully.
Current function value: 0.317330
Iterations 7
Optimization terminated successfully.
Current function value: 0.332526
Iterations 7
Optimization terminated successfully.
Current function value: 0.332235
Iterations 7
Optimization terminated successfully.
Current function value: 0.332497
Iterations 7
Optimization terminated successfully.
Current function value: 0.327306
Iterations 7
Optimization terminated successfully.
Current function value: 0.332093
Iterations 7
Optimization terminated successfully.
Current function value: 0.323578
Iterations 7
Optimization terminated successfully.
Current function value: 0.326640
Iterations 7
Optimization terminated successfully.
Current function value: 0.323385
Iterations 7
Optimization terminated successfully.
Current function value: 0.345222
Iterations 7
Optimization terminated successfully.
Current function value: 0.344986
Iterations 7
Optimization terminated successfully.
Current function value: 0.339378
Iterations 7
Optimization terminated successfully.
Current function value: 0.343217
Iterations 7
Optimization terminated successfully.
Current function value: 0.339273
Iterations 7
Optimization terminated successfully.
Current function value: 0.343176
Iterations 7
Optimization terminated successfully.
Current function value: 0.339078
Iterations 7
Optimization terminated successfully.
Current function value: 0.338721
Iterations 7
Optimization terminated successfully.
Current function value: 0.349988
Iterations 7
Optimization terminated successfully.
Current function value: 0.349287
Iterations 7
Optimization terminated successfully.
Current function value: 0.343349
Iterations 7
Optimization terminated successfully.
Current function value: 0.346967
Iterations 7
Optimization terminated successfully.
Current function value: 0.343348
Iterations 7
Optimization terminated successfully.
Current function value: 0.346967
Iterations 7
Optimization terminated successfully.
Current function value: 0.342701
Iterations 7
Optimization terminated successfully.
Current function value: 0.342536
Iterations 7
Fit complete