Easy Data Visualisation for Tidy Data with Lets-Plot#

Introduction#

Here you’ll see how to make plots quickly using the declarative plotting package lets-plot. This package is perfect if you want to make a standard chart from so-called tidy data where you have one row per observation and one columnn per variable. This chapter has benefitted from the book ggplot: elegant graphics for data analysis.

Note

lets-plot is the quickest way to get going with plots in Python.

Preliminaries#

To install lets-plot, run pip install lets-plot on the command line. We’re also going to be using the Palmer Penguins dataset, so you’ll need to run pip install palmerpenguins too.

There is some background information that you might find useful in getting to grips with lets-plot. All plots are composed of the data, the information you want to visualise, and a mapping: the description of how the data’s variables are mapped to aesthetic attributes. There are five mapping components:

  • A layer is a collection of geometric elements and statistical transformations. Geometric elements, geoms for short, represent what you actually see in the plot: points, lines, polygons, etc. Statistical transformations, stats for short, summarise the data: for example, binning and counting observations to create a histogram, or fitting a linear model.

  • Scales map values in the data space to values in the aesthetic space. This includes the use of colour, shape or size. Scales also draw the legend and axes.

  • A coord, or coordinate system, describes how data coordinates are mapped to the plane of the graphic. It also provides axes and gridlines to help read the graph. We normally use the Cartesian coordinate system, but a number of others are available, including polar coordinates and map projections.

  • A facet specifies how to break up and display subsets of data as small multiples.

  • A theme controls the finer points of display, like the font size and background colour. While the defaults have been chosen with care, you may need to consult other references to create an attractive plot.

As ever, we’re going to load the packages we’ll be using.

import pandas as pd
from palmerpenguins import load_penguins
from lets_plot import *

LetsPlot.setup_html()

You’ll notice the two main quirks of lets-plot already: the first is that we imported everything in the package using import *. This is to make it easier to use the package on-the-fly, because it has a lot of named functions. Second, we ran LetsPlot.setup_html(); this allows lets-plot charts to be displayed.

Getting started with lets-plot#

The goal of this section is to teach you how to produce useful graphics with lets-plot as quickly as possible. We’re going to cover:

  • The three key components of lets-plot chart: data, aesthetics and geoms

  • How to add additional variables to a plot with aesthetics

  • How to display additional categorical variables in a plot using small multiples created by faceting

  • A variety of different geoms that you can use to create different types of plots

  • How to modify the axes

  • Things you can do with a plot object other than display it in your interactive window, like save it to disk

Data#

Let’s load our data into pandas.

penguins = load_penguins()
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007
3 Adelie Torgersen NaN NaN NaN NaN NaN 2007
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 female 2007

Basics#

Every plot has three key components: data, aesthetic mappings, layers (at least one, called geoms). Here’s a simple example:

(
    ggplot(penguins, aes(x = "body_mass_g", y = "flipper_length_mm")) +
  geom_point()
)

This produces a scatterplot defined by those three elements: the data are from the penguin dataframe, the aesthetic mapping is x to body size in grams and flipper length in mm to the y position, and finally the layer or geom with points.

Note that data and aesthetic mappings were supplied to a function called ggplot, which accepts the data and aes, then layers/geoms are added on with +. The pattern will be similar for all lets-plot charts. In this case, we modified the geom to have slightly larger points by setting size=5, but we could have just left it with the default setting.

Note that the variables x and y in the aes call are necessary positional arguments, so you can simply omit saying x= and y= like this:

(
    ggplot(penguins, aes("body_mass_g", "flipper_length_mm")) +
  geom_point()
)

Adding extra dimensions: shape, colour, and size#

Although you should always be careful not to put too much information on a chart, you can add further dimensions to these plots. Let’s demonstrate this by adding colour to the mix:

(
    ggplot(penguins, aes("body_mass_g", "flipper_length_mm", colour="island")) +
  geom_point()
)

You can see that this has rendered the categorical variable “island” by having it appear in different colours. A legend has automatically been added. Do remember that not everyone can see all colours well, so it’s best to use colourblind-friendly colour scales whenever possible.

Note

Be careful with data types when adding extra dimensions to charts: if your data type is float or int instead of categorical, you will get a continous colour gradient instead of a discrete colour scale.

Let’s look at shape too:

(
   ggplot(penguins, aes("body_mass_g", "flipper_length_mm", shape="island")) +
  geom_point()
)

Although we previously set the size of the points overall, we can use them as an aesthetic too:

(
    ggplot(penguins, aes("body_mass_g", "flipper_length_mm", size="island")) +
  geom_point(alpha=0.5)
)

In the above, we used alpha=0.5, which is a transparency setting, to make it easier to see overlapping points.

And just as we previously set the size of the points to be a single, universal value, we can do the same for shape and colour—we just need to set it in geom_point().

Facets#

You can use facets (aka small multiples) to display more dimensions of information too. To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() tells the function what variable to have in successive charts. The variable that you pass to facet_wrap() should be categorical.

(
    ggplot(penguins, aes("body_mass_g", "flipper_length_mm"))
    + geom_point()
    + facet_wrap(facets="island", ncol=3)
)

Plot Geoms#

By substituting geom_point() for a different geom function, you’ll get a different type of plot. You’re now going to see some of the other important geoms provided in lets-plot.

  • geom_smooth() fits a smoothed conditional line then plots it and its standard error.

  • geom_boxplot() produces a box-and-whisker plot to summarise the distribution of a set of points.

  • geom_histogram() and geom_density() show the distribution of continuous variables.

  • geom_bar() shows counts of categorical variables.

  • geom_path() and geom_line() draw lines between the data points. A line plot is constrained to produce lines that travel from left to right, while paths can go in any direction. Lines are typically used to explore how things change over time.

Let’s take a closer look at some of these:

Fitting a line#

(
    ggplot(penguins, aes("body_mass_g", "flipper_length_mm")) + 
  geom_point() +
  geom_smooth(method="loess")
)

You can use a linear model instead with method="lm" (this is the default).

Jittered points and boxplots#

These are especially useful when we have lots of data that overlap, or want to get more of an idea of the overall distribution, or both.

(
    ggplot(penguins, aes("island", "body_mass_g"))
    + geom_jitter()
)

Box plots are created via:

(
    ggplot(penguins, aes("island", "body_mass_g"))
    + geom_boxplot()
)

Histograms and probability density plots#

You’re probably getting a good idea of how this works now! Here are the geoms for histograms and probability density plots.

(
    ggplot(penguins, aes("body_mass_g"))
    + geom_histogram()
)

geom_histogram() has a bins= keyword argument.

(
    ggplot(penguins, aes("body_mass_g"))
    + geom_density()
)

Remember, as ever, you can use help(FUNCTIONNAME) to get help on the options and keyword arguments for any function.

Bar Charts#

These are as you’d expect, but if you don’t want a count of the number of items but just to display the given values, you can use the keyword argument stat="identity".

(
    ggplot(penguins, aes("species"))
    + geom_bar()
)

Line charts and time series#

Let’s grab some data with a time dimension from FRED: vacancies and unemployment percent in the USA.

import pandas_datareader.data as web
import datetime

start = datetime.datetime(2000, 1, 1)
end = datetime.datetime(2021, 1, 1)
code_dict = {
    "Vacancies": "LMJVTTUVUSA647N",
    "Unemployment": "UNRATE",
    "LabourForce": "CLF16OV",
}
list_dfs = [
    web.DataReader(value, "fred", start, end)
    .rename(columns={value: key})
    .groupby(pd.Grouper(freq="AS"))
    .mean()
    for key, value in code_dict.items()
]
vu_data = pd.concat(list_dfs, axis=1)
vu_data = vu_data.assign(Vacancies=100 * vu_data["Vacancies"] / (vu_data["LabourForce"] * 1e3)).dropna()
vu_data["Date"] = vu_data.index
vu_data["Year"] = vu_data.index.year
vu_data.head()
Vacancies Unemployment LabourForce Date Year
DATE
2001-01-01 3.028239 4.741667 143768.916667 2001-01-01 2001
2002-01-01 2.387254 5.783333 144856.083333 2002-01-01 2002
2003-01-01 2.212237 5.991667 146499.500000 2003-01-01 2003
2004-01-01 2.470209 5.541667 147379.583333 2004-01-01 2004
2005-01-01 2.753325 5.083333 149289.166667 2005-01-01 2005
(
    ggplot(vu_data, aes("Date", "Vacancies")) +
    geom_line(size=2)
)

We can make this even more interesting by looking at how two variables have co-moved in time together with a connected scatter plot.

(
    ggplot(vu_data, aes("Unemployment", "Vacancies")) +
    geom_path(size=1) +
    geom_point(size=5)
)

Labels and Titles#

xlab() and ylab() modify the x- and y-axis labels:

(
    ggplot(penguins, aes("body_mass_g", "flipper_length_mm")) +
  geom_point() +
  xlab("Body mass (g)") +
  ylab("Flipper length (mm)")
)

But you can also specify all labels and titles at once like so:

(
    ggplot(penguins, aes(x="flipper_length_mm", y="body_mass_g"))
    + geom_point(aes(color="species", shape="species"))
    + geom_smooth(method="lm")
    + labs(
        title="Body mass and flipper length",
        subtitle="Dimensions for Adelie, Chinstrap, and Gentoo Penguins",
        x="Flipper length (mm)",
        y="Body mass (g)",
        color="Species",
        shape="Species",
    )
)

Adding text annotations#

Should you wish to add text annotations to your plots, you can!

(
    ggplot(vu_data, aes("Unemployment", "Vacancies")) +
    geom_path(size=1, color="gray") +
    geom_point(color="gray", size=5) +
    geom_text(aes(label='Year'), position=position_nudge(y=0.3))
)

Limits on axes#

You always have an option when it comes to removing points from your data: you can filter your dataframe or change the limits on your axes when you are plotting data. If you wish to do the latter, use the xlim and ylim comnands to do this.

(
ggplot(penguins, aes(x="flipper_length_mm", y="body_mass_g")) +
  geom_point(size=4) +
  xlim(200, 230) +
  ylim(3e3, 5e3)
)

Other useful-to-know elements of lets-plot charts#

We don’t want to go into every detail of lets-plot here, as the documentation is absolutely excellent and comprehensive—so you can find whatever you need there. But it may be useful to at least know of some further features we didn’t look at here, such as:

  • changing the theme and look of a plot

  • changing the scales (eg the axis ticks)

  • maps and geospatial charts

  • sampling

  • contour and other plots that show three dimensions via a level set, \( z = f(x, y) \).

Saving your plots to file#

Once you’ve made a plot, you might want to save it as an image that you can use elsewhere. That’s the job of ggsave(), which will save the plot most recently created to disk:

plotted_data = (
    ggplot(penguins, aes(x="flipper_length_mm", y="body_mass_g")) + geom_point()
)
ggsave(plotted_data, filename="penguin-plot.svg")
'/Users/aet/Documents/git_projects/coding-for-economists/lets-plot-images/penguin-plot.svg'

This saved the figure to disk at the location shown—by default it’s in a subdirectory called “lets-plot-images”.

We used the file format “svg”. There are lots of output options to choose from to save your file to. Remember that, for graphics, vector formats are generally better than raster formats. In practice, this means saving plots in svg or pdf formats over jpg or png file formats. The svg format works in a lot of contexts (including Microsoft Word) and is a good default. To choose between formats, just supply the file extension and the file type will change automatically, eg “chart.svg” for svg or “chart.png” for png. You can also save figures in HTML format.

If you’re using a raster format then you’ll need to specify how big the figure is via the scale keyword argument.