Tables

Tables#

Introduction#

The humble table is an underloved and underappreciated element of communicating analysis. While it may not be as visually engaging as a vivid graph (and is far less good for a general audience), it has the advantage of being able to convey exact numerical information. It’s also an essential part of some analysis: for example, when writing economics papers, there is usually a “table 1” that contains descriptive statistics. (For more on best practice for tables, check out the advice provided by the UK government’s Analysis Function.)

Options#

There are three main options for tables in Python.

pandas, which has some solid table export options. This is especially convenient if your data are already in a dataframe.
The gt (great tables) package by Posit, who also made an R version of the same package. This makes really good HTML and latex tables: check here for the package website.
Finally, if you’re happy with image output formats, matplotlib—the infinitely flexible image package—is great. However, your tables have to be exported as image files rather than as machine-readable text-based files. If you want a graphic table that is more striking, matplotib is brilliant. I wouldn’t say it’s the simplest way to create tables, as it relies on content featured in Powerful Data Visualisation with Matplotlib and Common Plots I, plus more besides. Rather than cover it here, instead we’re just going to direct you to this excellent blog post on ‘the grammar of tables’ by Karina Bartolomé that walks through creating a table with matplotlib.

Note that some packages, like pyfixest, come with their own options to export tables. In the case of Pyfixest, there’s a built-in way to export regression tables to formats including latex. You can find more on this in the page on regressions.

pandas for tables#

Imports and setup#

As ever, we’ll start by importing some key packages and initialising any settings:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(
    seed_for_prng
)  # prng=probabilistic random number generator

We’ll use the penguins dataset to demonstrate the use of pandas in creating tables. These data come with the seaborn package, which you’ll need:

import seaborn as sns

pen = sns.load_dataset("penguins")
pen.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

Preparing and creating your table in pandas#

There are a few operations that you’ll want to do again, and again, and again to create tables. A cross-tab is one such operation! A cross-tab is just a count of the number of elements split by two groupings. Rather than just display this using the pd.crosstab() function, we can add totals or percentages using the margins= and normalize= commands.

In the below, we’ll use margins and normalisation so that each row sums to 1, and the last row shows the probability mass over islands.

pd.crosstab(pen["species"], pen["island"], margins=True, normalize="index")

island	Biscoe	Dream	Torgersen
species
Adelie	0.289474	0.368421	0.342105
Chinstrap	0.000000	1.000000	0.000000
Gentoo	1.000000	0.000000	0.000000
All	0.488372	0.360465	0.151163

The neat thing about the cross-tabs that come out is that they are themselves pandas dataframes.

Of course, the usual pandas functions can be used to create any table you need:

pen_summary = pen.groupby(["species", "island"]).agg(
    median_bill=("bill_length_mm", "median"),
    mean_bill=("bill_length_mm", "mean"),
    std_flipper=("flipper_length_mm", "std"),
)
pen_summary

		median_bill	mean_bill	std_flipper
species	island
Adelie	Biscoe	38.70	38.975000	6.729247
	Dream	38.55	38.501786	6.585083
	Torgersen	38.90	38.950980	6.232238
Chinstrap	Dream	49.55	48.833824	7.131894
Gentoo	Biscoe	47.30	47.504878	6.484976

For reasons that will become apparent later, we’ll replace one of these values with a missing value (pd.NA).

pen_summary.iloc[2, 1] = pd.NA

The table we just saw, pen_summary, is not what you’d call publication quality. The numbers have lots of superfluous digits. The names are useful for when you’re doing analysis, but might not be so obvious to someone coming to this table for the first time. So let’s see what we can do to clean it up a bit.

First, those numbers. We can apply number rounding quickly using .round().

pen_summary.round(2)

		median_bill	mean_bill	std_flipper
species	island
Adelie	Biscoe	38.70	38.98	6.73
	Dream	38.55	38.50	6.59
	Torgersen	38.90	NaN	6.23
Chinstrap	Dream	49.55	48.83	7.13
Gentoo	Biscoe	47.30	47.50	6.48

This returns another dataframe. To change the names of the columns, you can just use one of the standard approaches:

pen_sum_clean = pen_summary.rename(
    columns={
        "median_bill": "Median bill length (mm)",
        "mean_bill": "Mean bill length (mm)",
        "std_flipper": "Std. deviation of flipper length",
    }
)
pen_sum_clean

		Median bill length (mm)	Mean bill length (mm)	Std. deviation of flipper length
species	island
Adelie	Biscoe	38.70	38.975000	6.729247
	Dream	38.55	38.501786	6.585083
	Torgersen	38.90	NaN	6.232238
Chinstrap	Dream	49.55	48.833824	7.131894
Gentoo	Biscoe	47.30	47.504878	6.484976

One tip is to always have a dictionary up your sleeve that maps between the short names that are convenient for coding, and the longer names that you need to make outputs clear. Then, just before you do any exporting, you can always map the short names into the long names.

Styling a pandas dataframe#

As well as making direct modifications to a dataframe, you can apply styles. These are much more versatile ways to achieve styling of a table for some output formats, namely HTML and, although it isn’t perfect, for Latex too. (But this doesn’t work for markdown outputs, and markdown doesn’t support such rich formatting in any case.)

Behind the scenes, when a table is displayed on a webpage like the one you’re reading right now, HTML (the language most of the internet is in) is used. Styling is a way of modifying the default HTML for showing tables so that they look nicer or better.

In the example below, you can see some of the options that are available:

precision is like .round()
na_rep sets how missing values are rendered
thousands sets the separator between every thousand (for readability)
formatter gives fine-grained control over the formatting of individual columns

pen_styled = pen_sum_clean.style.format(
    precision=3,
    na_rep="Value missing",
    thousands=",",
    formatter={
        "Mean bill length (mm)": "{:.1f}",
        "Std. deviation of flipper length (mm)": lambda x: "{:,.0f} um".format(x * 1e3),
    },
).set_caption("This is the title")
pen_styled

This is the title
		Median bill length (mm)	Mean bill length (mm)	Std. deviation of flipper length
species	island
Adelie	Biscoe	38.700	39.0	6.729
	Dream	38.550	38.5	6.585
	Torgersen	38.900	Value missing	6.232
Chinstrap	Dream	49.550	48.8	7.132
Gentoo	Biscoe	47.300	47.5	6.485

If you need to add more labels to either the index or the column names, you can. It’s a bit fiddly, but you can.

pen_sum_extra_col_info = pen_sum_clean.copy()  # create an independent copy
pen_sum_extra_col_info.columns = [["Lengths", "Lengths", "Stds"], pen_sum_clean.columns]

pen_sum_extra_col_info_styled = (
    pen_sum_extra_col_info.style.format(
        precision=3,
        na_rep="Value missing",
        thousands=",",
        formatter={
            "Mean bill length (mm)": "{:.1f}",
            "Std. deviation of flipper length (mm)": lambda x: "{:,.0f} um".format(
                x * 1e3
            ),
        },
    )
    .set_caption("This is the title")
    .set_table_styles([{"selector": "th", "props": [("text-align", "center")]}])
)
pen_sum_extra_col_info_styled

This is the title
		Lengths		Stds
		Median bill length (mm)	Mean bill length (mm)	Std. deviation of flipper length
species	island
Adelie	Biscoe	38.700	38.975	6.729
	Dream	38.550	38.502	6.585
	Torgersen	38.900	Value missing	6.232
Chinstrap	Dream	49.550	48.834	7.132
Gentoo	Biscoe	47.300	47.505	6.485

Let’s see an example of exporting this to Latex too:

pen_sum_extra_col_info_styled.to_latex()

'\\begin{table}\n\\caption{This is the title}\n\\thcenter\n\\begin{tabular}{llrrr}\n &  & \\multicolumn{2}{r}{Lengths} & Stds \\\\\n &  & Median bill length (mm) & Mean bill length (mm) & Std. deviation of flipper length \\\\\nspecies & island &  &  &  \\\\\n\\multirow[c]{3}{*}{Adelie} & Biscoe & 38.700 & 38.975 & 6.729 \\\\\n & Dream & 38.550 & 38.502 & 6.585 \\\\\n & Torgersen & 38.900 & Value missing & 6.232 \\\\\nChinstrap & Dream & 49.550 & 48.834 & 7.132 \\\\\nGentoo & Biscoe & 47.300 & 47.505 & 6.485 \\\\\n\\end{tabular}\n\\end{table}\n'

You can read more about the style functionality over at the pandas Styler docs.

Another thing you might reasonably want to do is provide summary statistics for all columns, as the last column, or for all rows, as the last row. To do this for, say, the mean when not using the pd.crosstab function, you will need to insert a summary row into the dataframe object. To start with, you need to actually create the summary row—this is achieved with the .mean(axis=0) method to get the mean over rows. We cast this into being a dataframe using pd.DataFrame as otherwise it would just be a single column, or Series object. Then we need to give the columns a name that’s better than the default “0”, and we choose a multi-level (here, two level) column name recognising that the index of our original dataframe has two levels: species and island. We actually just want to put “summary” in once and we’ve arbitrarily chosen the first level for that. Note that multi-level indices and columns can get complicated but the essential trick to bear in mind is that you replace a list of strings with a list of tuples of strings, eg for the first column ["Summary:", ...] becomes [("Summary", "summary level two"), ...]. Putting this all together gives us a dataframe with an index that is the same as the columns of the original dataframe:

summary_row = pd.DataFrame(pen_sum_extra_col_info.mean(axis=0))
summary_row.columns = [("Summary:", "")]  # note our index has two levels
summary_row

		(Summary:, )
Lengths	Median bill length (mm)	42.600000
Lengths	Mean bill length (mm)	43.453872
Stds	Std. deviation of flipper length	6.632687

Next we need to transpose the new summary row (so that its columns align with those in our original data frame) using .T, and concatenate the two dataframes together:

pd.concat([pen_sum_extra_col_info, summary_row.T], axis=0)

		Lengths		Stds
		Median bill length (mm)	Mean bill length (mm)	Std. deviation of flipper length
Adelie	Biscoe	38.70	38.975000	6.729247
	Dream	38.55	38.501786	6.585083
	Torgersen	38.90	NaN	6.232238
Chinstrap	Dream	49.55	48.833824	7.131894
Gentoo	Biscoe	47.30	47.504878	6.484976
Summary:		42.60	43.453872	6.632687

Once this is done, you can apply all the usual stylings.

Writing pandas tables to file#

Writing pandas tables to file is fairly straightforward: just use one of pandas many, many output functions. These typically begin with .to_ and then the output name. The most useful output formats will be:

to_html()
to_latex()
to_string()
to_markdown()

Add the filename you’d like to write to within the brackets following these method names. For example, to write a latex table it would be:

pen_styled.to_latex(Path("outputs/table_one.tex"))

These files can then be picked up by other documents. Note that, sometimes, when exporting to Latex, the code will have “escape” characters, for example extra backslashes. In some versions of pandas you can turn these off with an escape=False keyword argument.

It’s not perfect, but if you’re writing a table to latex and want footnotes and source notes, you can make use of the to_latex() method’s caption= keyword argument.

Limitations to pandas tables#

pandas tables have some limitations. It is not easy to include rich information such as images: while they can be included in the HTML rendition of a table on a webpage or in a Jupyter Notebook, it’s far harder to export this kind of information sensibly to an output file that isn’t HTML.

A more tricky general issue with them is that it can be hard to include all of the relevant information you’d expect in a table: they work extremely well for a typical table that is just rows and columns with equal-sized cells, but it you want to include, say, a long row (of a single cell) at the end for either footnotes or source notes, there isn’t an obvious way to do it. Similarly, it’s hard to make parts of a table such as the title, sub-title, and stubhead label work well in all cases.

To create a footnote or source note row, you could insert an extra row like above, but it’s a very unsatisfactory work-around as your notes can only be in one of the columns (not spread across all of them) and it will lead to you losing the data types in at least one of your original columns.

However, there is another option for that…

great tables for tables#

Newcomer great tables has swept in and provided a compelling way to make really fancy tables right off the bat.

This section is indebted to the great tables documentation.

Let’s start with a basic example using some data on islands:

from great_tables.data import islands  # this is a pandas dataframe

islands.head()

	name	size
0	Africa	11506
1	Antarctica	5500
2	Asia	16988
3	Australia	2968
4	Axel Heiberg	16

Okay, now we’re going to construct a table. Now, importantly, unlike pandas, we can add in a subtitle, and as many source notes as we like (notes that appeaer under the table).

from great_tables import GT, md

islands_mini = islands.sort_values(by="size", ascending=False).head(10)

islands_table = (
    GT(islands_mini)
    .tab_header(
        title="Large Landmasses of the World",
        subtitle="The top ten largest are presented",
    )
    .tab_stub(rowname_col="name")
    .tab_source_note(
        source_note="Source: The World Almanac and Book of Facts, 1975, page 406."
    )
    .tab_source_note(
        source_note=md(
            "Reference: McNeil, D. R. (1977) *Interactive Data Analysis*. Wiley."
        )
    )
    .tab_stubhead(label="landmass")
    .fmt_integer(columns="size")
)

islands_table

landmass	size
Large Landmasses of the World
The top ten largest are presented
Asia	16,988
Africa	11,506
North America	9,390
South America	6,795
Antarctica	5,500
Europe	3,745
Australia	2,968
Greenland	840
New Guinea	306
Borneo	280
Source: The World Almanac and Book of Facts, 1975, page 406.
Reference: McNeil, D. R. (1977) Interactive Data Analysis. Wiley.

The naming convention of the different parts of the table should be fairly intuitive but the diagram below will help you to navigate all of the possible options:

The components of a (great) table

Our first example did not make use of spanner tables, but we can do so in a second, on air quality:

from great_tables import GT, html
from great_tables.data import airquality

airquality_m = airquality.head(10).assign(Year=1973)

gt_airquality = (
    GT(airquality_m)
    .tab_header(
        title="New York Air Quality Measurements",
        subtitle="Daily measurements in New York City (May 1-10, 1973)",
    )
    .tab_spanner(label="Time", columns=["Year", "Month", "Day"])
    .tab_spanner(label="Measurement", columns=["Ozone", "Solar_R", "Wind", "Temp"])
    .cols_move_to_start(columns=["Year", "Month", "Day"])
    .cols_label(
        Ozone=html("Ozone,<br>ppbV"),
        Solar_R=html("Solar R.,<br>cal/m<sup>2</sup>"),
        Wind=html("Wind,<br>mph"),
        Temp=html("Temp,<br>&deg;F"),
    )
)

gt_airquality

Time			Measurement
New York Air Quality Measurements
Daily measurements in New York City (May 1-10, 1973)
Year	Month	Day	Ozone, ppbV	Solar R., cal/m²	Wind, mph	Temp, °F
1973	5	1	41.0	190.0	7.4	67
1973	5	2	36.0	118.0	8.0	72
1973	5	3	12.0	149.0	12.6	74
1973	5	4	18.0	313.0	11.5	62
1973	5	5			14.3	56
1973	5	6	28.0		14.9	66
1973	5	7	23.0	299.0	8.6	65
1973	5	8	19.0	99.0	13.8	59
1973	5	9	8.0	19.0	20.1	61
1973	5	10		194.0	8.6	69

One of the great features of great tables is that we can add in “nanoplots” that provide further information on the data being presented but in a visual format. Here’s an example:

numbers_df = pd.DataFrame(
    {
        "example": ["Row " + str(x) for x in range(1, 5)],
        "numbers": [
            {"val": [20, 23, 6, 7, 37, 23, 21, 4, 7, 16]},
            {"val": [2.3, 6.8, 9.2, 2.42, 3.5, 12.1, 5.3, 3.6, 7.2, 3.74]},
            {"val": [-12, -5, 6, 3.7, 0, 8, -7.4]},
            {"val": [2, 0, 15, 7, 8, 10, 1, 24, 17, 13, 6]},
        ],
    }
)

GT(numbers_df).fmt_nanoplot(columns="numbers")

example	numbers
Row 1
Row 2
Row 3
Row 4

And one with a min and max as a reference:

GT(numbers_df).fmt_nanoplot(columns="numbers", reference_area=["min", "median"])

example	numbers
Row 1
Row 2
Row 3
Row 4

A line chart isn’t the only type supported. For example there are bars too. And there are lots of formatting options. (Try hovering your mouse over the table below!)

from great_tables import nanoplot_options

(
    GT(numbers_df).fmt_nanoplot(
        columns="numbers",
        plot_type="bar",
        autoscale=True,
        reference_line="min",
        reference_area=[0, "max"],
        options=nanoplot_options(
            data_bar_stroke_color="gray",
            data_bar_stroke_width=2,
            data_bar_fill_color="orange",
            data_bar_negative_stroke_color="blue",
            data_bar_negative_stroke_width=1,
            data_bar_negative_fill_color="lightblue",
            reference_line_color="pink",
            reference_area_fill_color="bisque",
            vertical_guide_stroke_color="blue",
        ),
    )
)

example	numbers
Row 1
Row 2
Row 3
Row 4

You can also colour cells as you like, or based on a condition (including the type of data.) Here’s an example showing that off:

import polars as pl
from great_tables.data import sza

# this is just some data prep
# that gets data in right shape
sza_pivot = (
    pl.from_pandas(sza)
    .filter((pl.col("latitude") == "20") & (pl.col("tst") <= "1200"))
    .select(pl.col("*").exclude("latitude"))
    .drop_nulls()
    .pivot(values="sza", index="month", on="tst", sort_columns=True)
)
sza_pivot.head()

shape: (5, 15)

month	0530	0600	0630	0700	0730	0800	0830	0900	0930	1000	1030	1100	1130	1200
str	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64	f64
"jan"	null	null	null	84.9	78.7	72.7	66.1	61.5	56.5	52.1	48.3	45.5	43.6	43.0
"feb"	null	null	88.9	82.5	75.8	69.6	63.3	57.7	52.2	47.4	43.1	40.0	37.8	37.2
"mar"	null	null	85.7	78.8	72.0	65.2	58.6	52.3	46.2	40.5	35.5	31.4	28.6	27.7
"apr"	null	88.5	81.5	74.4	67.4	60.3	53.4	46.5	39.7	33.2	26.9	21.3	17.2	15.5
"may"	null	85.0	78.2	71.2	64.3	57.2	50.2	43.2	36.1	29.1	26.1	15.2	8.8	5.0

(
    GT(sza_pivot, rowname_col="month")
    .data_color(
        domain=[90, 0],
        palette=["rebeccapurple", "white", "orange"],
        na_color="white",
    )
    .tab_header(
        title="Solar Zenith Angles from 05:30 to 12:00",
        subtitle=html("Average monthly values at latitude of 20&deg;N."),
    )
)

	0530	0600	0630	0700	0730	0800	0830	0900	0930	1000	1030	1100	1130	1200
Solar Zenith Angles from 05:30 to 12:00
Average monthly values at latitude of 20°N.
jan	None	None	None	84.9	78.7	72.7	66.1	61.5	56.5	52.1	48.3	45.5	43.6	43.0
feb	None	None	88.9	82.5	75.8	69.6	63.3	57.7	52.2	47.4	43.1	40.0	37.8	37.2
mar	None	None	85.7	78.8	72.0	65.2	58.6	52.3	46.2	40.5	35.5	31.4	28.6	27.7
apr	None	88.5	81.5	74.4	67.4	60.3	53.4	46.5	39.7	33.2	26.9	21.3	17.2	15.5
may	None	85.0	78.2	71.2	64.3	57.2	50.2	43.2	36.1	29.1	26.1	15.2	8.8	5.0
jun	89.2	82.7	76.0	69.3	62.5	55.7	48.8	41.9	35.0	28.1	21.1	14.2	7.3	2.0
jul	88.8	82.3	75.7	69.1	62.3	55.5	48.7	41.8	35.0	28.1	21.2	14.3	7.7	3.1
aug	None	83.8	77.1	70.2	63.3	56.4	49.4	42.4	35.4	28.3	21.3	14.3	7.3	1.9
sep	None	87.2	80.2	73.2	66.1	59.1	52.1	45.1	38.1	31.3	24.7	18.6	13.7	11.6
oct	None	None	84.1	77.1	70.2	63.3	56.5	49.9	43.5	37.5	32.0	27.4	24.3	23.1
nov	None	None	87.8	81.3	74.5	68.3	61.8	56.0	50.2	45.3	40.7	37.4	35.1	34.4
dec	None	None	None	84.3	78.0	71.8	66.1	60.5	55.6	50.9	47.2	44.2	42.4	41.8

All of these tables have shown up as gloriously rendered HTML, which is an expoert option via GT.as_raw_html(). And you can easily save them as high quality (lossless) image files too via GT.save().

There is also a nascent latex export option, which is critical for research work. While some features aren’t yet supported, if you need a vanilla table, it’s still very useful. For example, for the islands dataset we saw earlier, we have to ‘turn off’ the use of markdown in one of the source notes as so (using regular text instead):

simple_islands_table = (
    GT(islands_mini)
    .tab_header(
        title="Large Landmasses of the World",
        subtitle="The top ten largest are presented",
    )
    .tab_stub(rowname_col="name")
    .tab_source_note(
        source_note="Source: The World Almanac and Book of Facts, 1975, page 406."
    )
    .tab_source_note(
        source_note=(
            "Reference: McNeil, D. R. (1977) Interactive Data Analysis. Wiley."
        )
    )
    .tab_stubhead(label="landmass")
    .fmt_integer(columns="size")
)

simple_islands_table

landmass	size
Large Landmasses of the World
The top ten largest are presented
Asia	16,988
Africa	11,506
North America	9,390
South America	6,795
Antarctica	5,500
Europe	3,745
Australia	2,968
Greenland	840
New Guinea	306
Borneo	280
Source: The World Almanac and Book of Facts, 1975, page 406.
Reference: McNeil, D. R. (1977) Interactive Data Analysis. Wiley.

Then to export to a table that can be used in your main latex report file via \input{table_out.tex} it would be:

with open("table_out.tex", "w") as f:
    f.write(simple_islands_table.as_latex())