Automating Research Outputs

Automating Research Outputs#

In this chapter, you’ll learn how to automate the inclusion of figures and tables in LaTeX-derived research outputs including PDFs and slides——plus how to convert those outputs to Microsoft Word documents and more. Much of what you’ll see in this chapter applies to a wide range of coding languages.

This chapter has some similarities with another chapter, on Combining Code and Text in Quarto Markdown. Increasingly, we’re inclined to use Quarto to write papers where we want research outputs to be automated because it supports a few things that are hard or infeasible in LaTeX. These include within-document code execution and in-line insertion of variables. And the Quarto-based approach still generates a LaTeX file if you want! If you’re not sure where to start, we now recommend you begin with Quarto rather than pure LaTeX. Check the other chapter for more—particularly the last section.

In this chapter, we focus on “traditional” academic paper writing using the LaTeX typesetting language. It has a bit of a learning curve. If you’re just looking to create some automated reports using code and text rather than write pre-prints, working papers, journal articles, or academic-talk style slide decks—the chapter on Combining Code and Text in Quarto Markdown is going to be a better and easier fit for you.

Automating the inclusion of figures and tables in your research outputs has many benefits:

once configured, it’s clearly easier than manual updates
your paper can update at the touch of a button
it helps with creating a reproducible analytical pipeline (for more on these, see the Reproducible Analysis chapter).
it enforces structure on your project
automation is complementary to other good practices such as version control

Let’s now turn to the how.

Including research outputs in LaTeX documents and slides#

Let’s say you’re writing a paper, using \(\LaTeX\), or a presentation, using \(\LaTeX\) and beamer. (Perhaps you’d like the final document or presentation to be in Word, Powerpoint-that’s okay too, and we’ll come to it shortly, but let’s assume you’re writing it in \(\LaTeX\).)

Including code outputs is pretty simple, but is slightly different for figures and tables (the two main outputs you might include).

Figures#

For figures, the \(\LaTeX\) graphicx package is your friend as it allows you to set a directory where your figures live, for example outputs/figures, which would be set like this at the top of the document:

\usepackage{graphicx}
\graphicspath{{outputs/figures/}}

We’re imagining here that we have a project structure like this:

code.py
paper.tex
outputs/
    figures/
        chart.pdf
    tables/
        reg_table.tex

Then, whenever you need to include a figure, say chart.pdf, you can always do it using

\begin{figure}
	\includegraphics[width=\textwidth]{chart.pdf}
	\caption{Example figure. \label{fig:example}}
\end{figure}

Let’s pretend chart.pdf is generated by the most popular Python graphics library, matplotlib. The code in ‘code.py’ which puts the chart in the ‘figures’ folder could look something like this:

import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.scatter(range(5), range(5), s=50, c='b')
plt.savefig("outputs/figures/chart.pdf")

The important line here is plt.savefig("outputs/figures/chart.pdf") because it says to save the figure in the ‘figures’ directory. When you re-run your code, the chart ends up in the right place. When you re-compile your \(\LaTeX\) document or presentation, it can pick the chart up from the right place.

Tables#

Now let’s imagine you’ve created a table of descriptive statistics such as the one below:

import seaborn as sns
import pandas as pd

tips = sns.load_dataset("tips")

table = tips.groupby(["smoker", "time"], observed=True)["tip"].mean().unstack().round(2)
table

time	Lunch	Dinner
smoker
Yes	2.83	3.07
No	2.67	3.13

This can be turned into a \(\LaTeX\) table using the following command

table.style.to_latex(caption='A Table', label='tab:descriptive')

'\\begin{table}\n\\caption{A Table}\n\\label{tab:descriptive}\n\\begin{tabular}{lrr}\ntime & Lunch & Dinner \\\\\nsmoker &  &  \\\\\nYes & 2.830000 & 3.070000 \\\\\nNo & 2.670000 & 3.130000 \\\\\n\\end{tabular}\n\\end{table}\n'

Or perhaps you have a regression table, for example

import pandas as pd
from sklearn import datasets
import statsmodels.formula.api as smf
from stargazer.stargazer import Stargazer

diabetes = datasets.load_diabetes()
df = pd.DataFrame(diabetes.data)
df.columns = ['Age', 'Sex', 'BMI', 'ABP', 'S1', 'S2', 'S3', 'S4', 'S5', 'S6']
df['target'] = diabetes.target

est = smf.ols('target ~ Age + Sex + BMI + ABP', data=df).fit()
est2 = smf.ols('target ~ Age + Sex + BMI + ABP + S1 + S2', data=df).fit()

reg_results = Stargazer([est, est2])
reg_results


	Dependent variable: target

	(1)	(2)

ABP	416.673^***	397.581^***
	(69.495)	(70.870)
Age	37.241	24.703
	(64.117)	(65.411)
BMI	787.182^***	789.744^***
	(65.424)	(66.887)
Intercept	152.133^***	152.133^***
	(2.853)	(2.853)
S1		197.848
		(143.812)
S2		-169.243
		(142.744)
Sex	-106.576^*	-82.862
	(62.125)	(64.851)

Observations	442	442
R²	0.400	0.403
Adjusted R²	0.395	0.395
Residual Std. Error	59.976 (df=437)	59.982 (df=435)
F Statistic	72.913^*** (df=4; 437)	48.915^*** (df=6; 435)

Note:	^p<0.1; ^p<0.05; ^**p<0.01

import numpy as np
import pandas as pd
#import pylatex as pl  # for the latex table; note: not a dependency of pyfixest - needs manual installation
from great_tables import loc, style
from IPython.display import FileLink, display

import pyfixest as pf

data = pf.get_data()

fit1 = pf.feols("Y ~ X1 + X2 | f1", data=data)
fit2 = pf.feols("Y ~ X1 + X2 | f1 + f2", data=data)
fit3 = pf.feols("Y2 ~ X1 + X2 | f1", data=data)
fit4 = pf.feols("Y2 ~ X1 + X2 | f1 + f2", data=data)

pf.etable([fit1, fit2, fit3, fit4,])

	Y		Y2
	(1)	(2)	(3)	(4)
coef
X1	-0.950*** (0.067)	-0.924*** (0.061)	-1.267*** (0.174)	-1.232*** (0.192)
X2	-0.174*** (0.018)	-0.174*** (0.015)	-0.131** (0.042)	-0.118** (0.042)
fe
f1	x	x	x	x
f2	-	x	-	x
stats
Observations	997	997	998	998
S.E. type	by: f1	by: f1	by: f1	by: f1
R²	0.489	0.659	0.120	0.172
Significance levels: * p < 0.05, p < 0.01, * p < 0.001. Format of coefficient cell: Coefficient (Std. Error)

which can be cast into \(\LaTeX\) using type="tex".

tab = pf.etable(
    [fit1, fit2, fit3, fit4],
    digits=2,
    type="tex",
    print_tex=True,
)

tab

'\\renewcommand\\cellalign{t}\n\\begin{threeparttable}\n\\begin{tabular}{lccccc}\n\\toprule\n & \\multicolumn{2}{c}{Y} & \\multicolumn{2}{c}{Y2} \\\\\n\\cmidrule(lr){2-3} \\cmidrule(lr){4-5} \n & (1) & (2) & (3) & (4) \\\\\n\\midrule\n\\addlinespace\nX1 & \\makecell{-0.95*** \\\\ (0.07)} & \\makecell{-0.92*** \\\\ (0.06)} & \\makecell{-1.27*** \\\\ (0.17)} & \\makecell{-1.23*** \\\\ (0.19)} \\\\\nX2 & \\makecell{-0.17*** \\\\ (0.02)} & \\makecell{-0.17*** \\\\ (0.01)} & \\makecell{-0.13** \\\\ (0.04)} & \\makecell{-0.12** \\\\ (0.04)} \\\\\n\\midrule\n\\addlinespace\nf1 & x & x & x & x \\\\\nf2 & - & x & - & x \\\\\n\\midrule\n\\addlinespace\nObservations & 997 & 997 & 998 & 998 \\\\\nS.E. type & by: f1 & by: f1 & by: f1 & by: f1 \\\\\n$R^2$ & 0.49 & 0.66 & 0.12 & 0.17 \\\\\n\\bottomrule\n\\end{tabular}\n\\footnotesize Significance levels: $*$ p $<$ 0.05, $**$ p $<$ 0.01, $***$ p $<$ 0.001. Format of coefficient cell: Coefficient \n (Std. Error)\n\\end{threeparttable}'

We’d like to export tables like this into files that can be picked up by our \(\LaTeX\) document. We must first save it to the right place from Python. Assuming you have the folders “outputs/tables” relative to your working directory, this would be

from pathlib import Path
with open(Path('outputs/tables/reg_table.tex'), 'w') as f:
    f.write(table.style.to_latex(caption='A Table', label='tab:descriptive'))

in the first example, and

from pathlib import Path
with open(Path('outputs/tables/reg_table.tex'), 'w') as f:
    f.write(tab)

in the second. Remember that Path is a clever module that will find the relevant file path regardless of which operating system you happen to be using at the time. This is especially useful when you have co-authors on different systems!

The code chunk above opens up a file in write mode in the right directory relative to code.py, and puts the \(\LaTeX\) file into it. We now need to ensure that this \(\LaTeX\) gets picked up in our paper. Inside the paper, you need a line:

\input{outputs/tables/reg_table.tex}

which picks up your table. If you don’t want to have to add the full path to the tables directory each time, you can add this near the top of ‘paper.tex’:

\makeatletter
\providecommand*{\input@path}{}
\g@addto@macro\input@path{{outputs/tables/}}
\makeatother

So that you need only write \input{reg_table.tex} in your \(\LaTeX\) document.

Exporting papers and slides to other document types#

When including your research outputs automatically, you may not want your final output to be a PDF (the standard output for \(\LaTeX\)), but to be one of a range of other document types. That’s perfectly possible, and you can choose from a really wide range of output types, although input types will be limited to formats that can use file paths such as \(\LaTeX\) and markdown.

To perform the magic conversion to other document types (and often between types), we’ll use the command line tool pandoc, which is absolutely brilliant. It can translate \(\LaTeX\) papers and beamer presentations into a whole variety of other formats, including Microsoft Word’s .docx, OpenOffice’s .ODT, Microsoft Powerpoint’s .pptx, HTML, plain text, markdown, and more. It can also write from any of those formats (and more) in one direction to PDF, Microsoft Powerpoint, and \(\LaTeX\) Beamer.

To use pandoc, first install it following the instructions on the website.

Converting Documents#

To convert documents, the general syntax for pandoc looks like this:

pandoc mydoc.tex -o mydoc.docx

This is an example where the input is a .tex document and the output, -o, is a Microsoft Word docx file.

You can try this yourself using the following minimal tex file:

\documentclass{article}
\usepackage[margin=0.7in]{geometry}
\usepackage[pasrfill]{parskip}
\usepackage[utf8]{inputenc}
\usepackage{amsmath,amssymb,amsfonts,amsthm}

\begin{document}

This is some text

And an equation:
\[
    u'(c_{t})=\beta(1+r_{t+1})u'(c_{t+1})
\]

\section{Section Heading}

More text

\end{document}

Exercise

Create a .tex file from the tex code above and convert it to a word document using pandoc.

What’s surprising is how effective the conversion to word is: even if you have figures, equations, and other non-text features.

You can get quite fancy with pandoc, for example you can translate a whole book’s worth of latex into a Word doc complete with a Word style, a bibliography via biblatex, equations, and figures. Nothing can save Word from being painful to use, but pandoc certainly helps. If you want to see a couple of examples, you could check out cookie-cutter-latex-book-manuscript.

Converting Slides#

Beamer slides can be converted in much the same way that documents can. Popular output formats for slides include PDF, HTML (via dzslides, slidy, or revealjs), and .pptx (powerpoint).

For example, to create revealjs slides,

pandoc -f latex -t revealjs -s --self-contained -o presentation.html presentation.tex --mathjax

where presentation.tex is the input file. (Self-contained just creates a single, large output HTML file; mathjax enables equations in the HTML.) For powerpoint, the equivalent is

pandoc -f latex -t -o presentation.pptx presentation.tex

As with the example above and the reference file, you can use a reference powerpoint file for style. Here is a minimal example of the tex code for a beamer presentation:

\documentclass[aspectratio=169]{beamer}
\usepackage[english]{babel}
\usepackage[utf8x]{inputenc}
\mode<presentation>
{
  \usetheme{default}
  \usecolortheme{default}
  \usefonttheme{default}
  \setbeamertemplate{caption}[numbered]
}
\usepackage{graphicx}
\usepackage{booktabs}
\usepackage{hyperref}

\title{Title for a minimal beamer presentation}
\author{Author One}
\institute{Name of institution}
\date{\today}

\begin{document}
\begin{frame}
  \titlepage
\end{frame}

\section{Section One}

\begin{frame}{Slide with bullet points}
    This is a bullet list of two points:
    \begin{itemize}
        \item Point one
        \item Point two
    \end{itemize}
\end{frame}
\section{Section Two}
\begin{frame}
Slide with an equation
\[
    u'(c_{t})=\beta(1+r_{t+1})u'(c_{t+1})
\]
\end{frame}

\end{document}