Coming from Stata#
This chapter has benefitted enormously from Daniel M. Sullivan’s excellent notes.
The biggest difference between Python and Stata is that Python is a fully-fledged programming language, which means it can do lots of things, while Stata is really just for data analysis. What this means in practice is that sometimes the notation to do this or that operation in Python (or any other general purpose programming language) is less concise than in Stata. There is greater competition for each command in Python because it does many more things.
Another difference is that, in Stata, there is one dataset in memory that is represented as matrix where each column is a “variable” with a unique name. In Python, variables can be anything, even functions! But most data analysis in Python is done using dataframes, which are objects that are somewhat similar to a single dataset in Stata. In Python, you can have as many dataframes as you like in action at once. This causes the first major notational differences; in Python, you need to specify which dataframe you want to perform an operation on, in addition to which column (or row, or entry).
Finally, Python and its data analysis packages are free.
Regardless of Python not being a programming language solely dedicated to data analysis, it really does have first class support for data analysis via its pandas package. Support for doing regressions is perhaps less good than Stata, and certainly a bit more verbose—but you can still do pretty much every standard operation you can think of.
Stata <==> Python#
What follows is a giant table of translations between Stata code and Python, leaning heavily on Python’s pandas (panel-data-analysis) package. We’re going to rely on a few packages for econometrics in the below. They are statsmodels as your general purpose and flexible regression library, pyfixest for when you need high dimensional fixed effects, and binsreg for bin scatter.
Many of the examples below assume that, in Python, you have a pandas DataFrame called df
. We will use placeholders like varname
for Stata variables and df['varname']
for the Python equivalent. Remember that you need to import pandas as pd
before running any of the examples that use pd
. For the econometrics examples, you will need to import the relevant package.
You can find more on (frequentist) regressions in Regression, Bayesian regressions using formulae appear in Bayesian Inference Made Easier, generalised regression models appear in Generalised regression models, and regression diagnostics and visualisation are in Regression diagnostics and visualisations. For Bayesian regressions, Python is very strong: check out Bayesian Inference Made Easier.
Stata |
Python (pandas) |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
pandas has several reshaping functions, including |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The table below presents further examples of doing regression with both the statsmodels and pyfixest packages.
Note that, in the below, you need only import pf.feols
once in each Python session, and the syntax for looking at results is results = pf.feols(...)
and then results.summary()
.
Command |
Stata |
Python |
---|---|---|
Fixed Effects (absorbing) |
|
|
Categorical regression |
|
|
Interacting categoricals |
|
|
Robust standard errors |
|
|
Clustered standard errors |
|
|
Two-way clustered standard errors |
|
|
Instrumental variables |
|
|