Graphics for Communication

14. Graphics for Communication#

14.1. Introduction#

In this chapter, you’ll learn about using visualisation to communicate.

In Exploratory Data Analysis, you learned how to use plots as tools for exploration. When you make exploratory plots, you know—even before looking—which variables the plot will display. You made each plot for a purpose, quickly looked at it, and then moved on to the next plot. In the course of most analyses, you’ll produce tens or hundreds of plots, most of which are immediately thrown away.

Now that you understand your data, you need to communicate your understanding to others. Your audience will likely not share your background knowledge and will not be deeply invested in the data. To help others quickly build up a good mental model of the data, you will need to invest considerable effort in making your plots as self-explanatory as possible. In this chapter, you’ll learn some of the tools that lets-plot provides to do make charts tell a story.

14.1.1. Prerequisities#

As ever, there are a plethora of options (and packages) for data visualisation using code. We’re focusing on the declarative, “grammar of graphics” approach using lets-plot here, but advanced users looking for more complex graphics might wish to use an imperative library such as the excellent matplotlib. You should have both lets-plot and pandas installed. Once you have them installed, import them like so:

import numpy as np
import pandas as pd
from lets_plot import *

LetsPlot.setup_html()

14.2. Labels, titles, and other contextual information#

The easiest place to start when turning an exploratory graphic into an expository graphic is with good labels. Let’s look at an example using the MPG (miles per gallon) data, which covers the fuel economy for 38 popular models of cars from 1999 to 2008.

# load the data
mpg = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/mpg.csv", index_col=0
)

We want to show fuel efficiency on the highway changes with engine displacement, in litres. The most basic chart we can do with these variables is:

(ggplot(mpg, aes(x="displ", y="hwy")) + geom_point())

Now we’re going to add lots of extra useful information that will make the chart better. The purpose of a plot title is to summarize the main finding. Avoid titles that just describe what the plot is, e.g., “A scatterplot of engine displacement vs. fuel economy”.

We’re going to:

add a title that summarises the main finding you’d like the viewer to take away (as opposed to one just describing the obvious!)
add a subtitle that provides more info on the y-axis, and make the x-label more understandable
remove the y-axis label that is at an awkward viewing angle
add a caption with the source of the data

Putting this all in, we get:

(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(aes(colour="class"))
    + geom_smooth(se=False, method="loess", size=1)
    + labs(
        title="Fuel efficiency generally decreases with engine size",
        subtitle="Highway fuel efficiency (miles per gallon)",
        caption="Source: fueleconomy.gov",
        y="",
        x="Engine displacement (litres)",
    )
)

This is much clearer. It’s easier to read, we know where the data come from, and we can see why we’re being shown it too.

But maybe we want a different message? You can flex depending on your needs, and some people prefer to have a rotated y-axis so that the subtitle can provide even more context:

(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(aes(colour="class"))
    + geom_smooth(se=False, method="loess", size=1)
    + labs(
        x="Engine displacement (L)",
        y="Highway fuel economy (mpg)",
        colour="Car type",
        title="Fuel efficiency generally decreases with engine size",
        subtitle="Two seaters (sports cars) are an exception because of their light weight",
        caption="Source: fueleconomy.gov",
    )
)

14.2.1. Exercises#

Create one plot on the fuel economy data with customized title, subtitle, caption, x, y, and color labels.
Recreate the following plot using the fuel economy data. Note that both the colours and shapes of points vary by type of drive train.

Take an exploratory graphic that you’ve created in the last month, and add informative titles to make it easier for others to understand.

14.3. Annotations#

In addition to labelling major components of your plot, it’s often useful to label individual observations or groups of observations. The first tool you have at your disposal is geom_text(). geom_text() is similar to geom_point(), but it has an additional aesthetic: label. This makes it possible to add textual labels to your plots.

There are two possible sources of labels: ones that are part of the data, which we’ll add with geom_text(); and ones that we add directly and manually as annotations using geom_label().

In the first case, you might have a data frame that contains labels. In the following plot we pull out the cars with the highest engine size in each drive type and save their information as a new data frame called label_info. In creating it, we pick out the mean values of “hwy” by “drv” as the points to label—but we could do any aggregation we feel would work well on the chart.

mapping = {
    "4": "4-wheel drive",
    "f": "front-wheel drive",
    "r": "rear-wheel drive",
}
label_info = (
    mpg.groupby("drv")
    .agg({"hwy": "mean", "displ": "mean"})
    .reset_index()
    .assign(drive_type=lambda x: x["drv"].map(mapping))
    .round(2)
)
label_info

	drv	hwy	displ	drive_type
0	4	19.17	4.00	4-wheel drive
1	f	28.16	2.56	front-wheel drive
2	r	21.00	5.18	rear-wheel drive

Then, we use this new data frame to directly label the three groups to replace the legend with labels placed directly on the plot. Using the fontface and size arguments we can customize the look of the text labels. They’re larger than the rest of the text on the plot and bolded. (theme(legend.position = "none") turns all the legends off — we’ll talk about it more shortly.)

(
    ggplot(mpg, aes(x="displ", y="hwy", color="drv"))
    + geom_point(alpha=0.5)
    + geom_smooth(se=False, method="loess")
    + geom_text(
        aes(x="displ", y="hwy", label="drive_type"),
        data=label_info,
        fontface="bold",
        size=8,
        hjust="left",
        vjust="bottom",
    )
    + theme(legend_position="none")
)

Note the use of hjust (horizontal justification) and vjust (vertical justification) to control the alignment of the label.

The second of the two methods we’re looking at is geom_label(). This has two modes: in the first, it works like geom_text() but with a box around the text, like so:

potential_outliers = mpg.query("hwy > 40 | (hwy > 20 & displ > 5)")
(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(color="black")
    + geom_smooth(se=False, method="loess", color="black")
    + geom_point(
        data=potential_outliers,
        color="red",
    )
    + geom_label(
        aes(label="model"),
        data=potential_outliers,
        color="red",
        position=position_jitter(),
        fontface="bold",
        size=5,
        hjust="left",
        vjust="bottom",
    )
    + theme(legend_position="none")
)

The second method is generally useful for adding either a single or several annotations to a plot, like so:

import textwrap

# wrap the text so it is over multiple lines:
trend_text = textwrap.fill("Larger engine sizes tend to have lower fuel economy.", 30)
trend_text

'Larger engine sizes tend to\nhave lower fuel economy.'

(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point()
    + geom_label(x=3.5, y=38, label=trend_text, hjust="left", color="red")
    + geom_segment(x=2, y=40, xend=5, yend=25, arrow=arrow(type="closed"), color="red")
)

Annotation is a powerful tool for communicating main takeaways and interesting features of your visualisations. The only limit is your imagination (and your patience with positioning annotations to be aesthetically pleasing)!

Remember, in addition to geom_text() and geom_label(), you have many other geoms in lets-plot available to help annotate your plot. A couple ideas:

Use geom_hline() and geom_vline() to add reference lines. We often make them thick (size = 2) and grey (color = gray), and draw them underneath the primary data layer. That makes them easy to see, without drawing attention away from the data.
Use geom_rect() to draw a rectangle around points of interest. The boundaries of the rectangle are defined by aesthetics xmin, xmax, ymin, ymax.
You already saw the use of geom_segment() with the arrow argument to draw attention to a point with an arrow. Use aesthetics x and y to define the starting location, and xend and yend to define the end location.

14.3.1. Exercises#

Use geom_text() with infinite positions to place text at the four corners of the plot.
Use geom_label() to add a point geom in the middle of your last plot without having to create a data frame Customise the shape, size, or colour of the point.
How do labels with geom_text() interact with faceting? How can you add a label to a single facet? How can you put a different label in each facet? (Hint: Think about the dataset that is being passed to geom_text().)
What arguments to geom_label() control the appearance of the background box?
What are the four arguments to arrow()? How do they work? Create a series of plots that demonstrate the most important options.

14.4. Scales#

Another you can make your plot better for communication is to adjust the scales. Scales control how the aesthetic mappings manifest visually.

14.4.1. Default scales#

Normally, lets-plot automatically adds scales for you and you don’t need to worry about them. For example, when you type:

(
    ggplot(mpg, aes(x="displ", y="hwy")) +
    geom_point(aes(color="class"))
)

lets-plot is automatically doing this behind the scenes:

(
    ggplot(mpg, aes(x="displ", y="hwy")) +
    geom_point(aes(color="class")) +
    scale_x_continous() +
    scale_y_continuous() +
    scale_color_discrete()
)

Note the naming scheme for scales: scale_ followed by the name of the aesthetic, then _, then the name of the scale. The default scales are named according to the type of variable they align with: continuous, discrete, datetime, or date. scale_x_continuous() puts the numeric values from displ on a continuous number line on the x-axis, scale_color_discrete() chooses colours for each of the class of car, etc. There are lots of non-default scales which you’ll learn about below.

The default scales have been carefully chosen to do a good job for a wide range of inputs. Nevertheless, you might want to override the defaults for two reasons:

You might want to tweak some of the parameters of the default scale. This allows you to do things like change the breaks on the axes, or the key labels on the legend.
You might want to replace the scale altogether, and use a completely different algorithm. Often you can do better than the default because you know more about the data.

14.4.2. Axis ticks and legend keys#

Collectively axes and legends get the somewhat confusing name guides in lets-plot. Axes are used for x and y aesthetics; legends are used for everything else.

There are two primary arguments that affect the appearance of the ticks on the axes and the keys on the legend: breaks and labels. Breaks controls the position of the ticks, or the values associated with the keys. If you like, the breaks are the ticks. Labels controls the text label associated with each tick/key. We might more accurately call these tick labels. The most common use of breaks is to override the default choice:

(
    ggplot(mpg, aes(x="displ", y="hwy", color="drv"))
    + geom_point()
    + scale_y_continuous(breaks=np.arange(15, 40, step=5))
)

You can use labels in the same way (ie pass in an array or list of strings the same length as breaks). To remove them altogether, you would have to use a theme, though, a topic we’ll return to later. You can also use breaks and labels to control the appearance of legends. For discrete scales for categorical variables, labels can be a named list of the existing levels names and the desired labels for them.

(
    ggplot(mpg, aes(x="displ", y="hwy", color="drv"))
    + geom_point()
    + scale_color_discrete(labels=["4-wheel", "front", "rear"])
)

To change the formatting of the tick labels, use the format= keyword argument. This is useful to render currencies, percentages, and so on—though it’s often easier for the reader to just see this symbol once in the axis label.

In the example below, we read in the diamonds dataset and then format it with a command format="$.2s"; let’s break this down:

the dollar sign says put a dollar sign in front of every number
the .2 says use two significant digits
the s says, use the Système International (SI)

There are a wealth of alternative options for formatting—it’s best to use the helpful page on formatting in the documentation of lets-plot to find out more.

diamonds = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/diamonds.csv",
    index_col=0,
)
diamonds["cut"] = diamonds["cut"].astype(
    pd.CategoricalDtype(
        categories=["Fair", "Good", "Very Good", "Premium", "Ideal"], ordered=True
    )
)
diamonds["color"] = diamonds["color"].astype(
    pd.CategoricalDtype(categories=["D", "E", "F", "G", "H", "I", "J"], ordered=True)
)

(
    ggplot(diamonds, aes(x="cut", y="price"))
    + geom_boxplot()
    + coord_flip()
    + scale_y_continuous(format="$.2s", breaks=np.arange(0, 19000, step=6000))
)

Another use of breaks is when you have relatively few data points and want to highlight exactly where the observations occur. For example, take this plot that shows when each US president started and ended their term.

presidential = pd.read_csv(
    "https://vincentarelbundock.github.io/Rdatasets/csv/ggplot2/presidential.csv",
    index_col=0,
)
presidential = presidential.astype({"start": "datetime64[ns]", "end": "datetime64[ns]"})
presidential["id"] = 33 + presidential.index
presidential.head()

	name	start	end	party	id
rownames
1	Eisenhower	1953-01-20	1961-01-20	Republican	34
2	Kennedy	1961-01-20	1963-11-22	Democratic	35
3	Johnson	1963-11-22	1969-01-20	Democratic	36
4	Nixon	1969-01-20	1974-08-09	Republican	37
5	Ford	1974-08-09	1977-01-20	Republican	38

(
    ggplot(presidential, aes(x="start", y="id"))
    + geom_point()
    + geom_segment(aes(xend="end", yend="id"))
    + scale_x_datetime()
)

14.4.3. Legend layout#

You will most often use breaks and labels to tweak the axes. While they both also work for legends, there are a few other techniques you are more likely to use.

To control the overall position of the legend, you need to use a theme() setting. We’ll come back to themes at the end of the chapter, but in brief, they control the non-data parts of the plot. The theme setting legend.position controls where the legend is drawn, and to demonstrate this we’ll use gggrid() to arrange all of the plots.

base = ggplot(mpg, aes(x="displ", y="hwy")) + geom_point(aes(color="class"))

p1 = base + theme(legend_position="right")  # the default
p2 = base + theme(legend_position="left")
p3 = base + theme(legend_position="top") + guides(color=guide_legend(nrow=3))
p4 = base + theme(legend_position="bottom") + guides(color=guide_legend(nrow=3))

gggrid([p1, p2, p3, p4], ncol=2)

If your plot is short and wide, place the legend at the top or bottom, and if it’s tall and narrow, place the legend at the left or right. You can also use legend_position = "none" to suppress the display of the legend altogether.

To control the display of individual legends, use guides() along with guide_legend() or guide_colorbar().

14.4.4. Replacing a scale#

Instead of just tweaking the details a little, you can instead replace the scale altogether. There are two types of scales you’re mostly likely to want to switch out: continuous position scales and colour scales. Fortunately, the same principles apply to all the other aesthetics, so once you’ve mastered position and colour, you’ll be able to quickly pick up other scale replacements.

It’s very useful to plot transformations of your variable. For example, it’s easier to see the precise relationship between carat and price if we log transform them. The way to do this is by using an apply() function on the data that gets sent to ggplot:

(
    ggplot(
        diamonds.apply({"carat": np.log10, "price": np.log10}),
        aes(x="carat", y="price"),
    )
    + geom_bin2d()
)

However, the disadvantage of this transformation is that the axes are now mislabelled with the original values, making it hard to interpret the plot. Instead of doing the transformation in the aesthetic mapping, we can instead do it with the scale. This is visually identical, except the axes are labelled on the original data scale.

(
    ggplot(diamonds, aes(x="carat", y="price"))
    + geom_bin2d()
    + scale_x_log10()
    + scale_y_log10()
)

Another scale that is frequently customised is colour. The default categorical scale picks colors that are evenly spaced around the color wheel. Useful alternatives are the ColorBrewer scales which have been hand tuned to work better for people with common types of colour blindness. The two plots below look similar, but there is enough difference in the shades of red and green that the dots on the right can be distinguished even by people with red-green colour blindness.

(ggplot(mpg, aes(x="displ", y="hwy")) + geom_point(aes(color="drv")))

(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(aes(color="drv"))
    + scale_color_brewer(palette="Set1")
)

Don’t forget simpler techniques for improving accessibility. If there are just a few colors, you can add a redundant shape mapping. This will also help ensure your plot is interpretable in black and white.

The ColorBrewer scales are documented online at https://colorbrewer2.org/. The sequential (top) and diverging (bottom) palettes are particularly useful if your categorical values are ordered, or have a “middle”. This often arises if you’ve used pd.cut() to make a continuous variable into a categorical variable.

_images/068fd8449f31541eb3f3657cfdc63a632081c4295178e57aa823e6b2cc0aa891.svg

_images/61141676aba7929ba941843a6715e55b7a63da77acfcae9b4b99125320728868.svg

_images/bd4f1cbd6c5a89cca1370938701fd91af8e93b1d008e9253e6af7bab0bf688e7.svg

When you have a predefined mapping between values and colours, use scale_color_manual(). For example, if we map presidential party to colour, we want to use the standard mapping of red for Republicans and blue for Democrats. One approach for assigning these colors is using hex colour codes:

mini_presid = presidential.iloc[5:, :]

(
    ggplot(mini_presid, aes(x="start", y="id", color="party"))
    + geom_point(size=3)
    + geom_segment(aes(xend="end", yend="id"), size=1)
    + scale_x_datetime(breaks=mini_presid["start"], format="%Y")
    + scale_color_manual(values=["#00AEF3", "#E81B23"], name="party")
)

You can also use typical colour names such as “red” and “blue”.

For continuous colour, you can use the built-in scale_color_gradient() or scale_fill_gradient(). If you have a diverging scale, you can use scale_color_gradient2(). That allows you to give, for example, positive and negative values different colors. That’s sometimes also useful if you want to distinguish points above or below the mean.

Another option is to use the viridis, magma, inferno, and plasma color scales developed for the extremely powerful imperative Python plotting package matplotlib. The designers, Nathaniel Smith and Stéfan van der Walt, carefully tailored continuous color schemes that are perceptible to people with various forms of color blindness as well as perceptually uniform in both color and black and white. These scales are available as palettes in lets-plot. Here’s an example using the continuous version of viridis (we’ll generate some random data first):

prng = np.random.default_rng(1837)  # prng=probabilistic random number generator
df_rnd = pd.DataFrame(prng.standard_normal((1000, 2)), columns=["x", "y"])
(
    ggplot(df_rnd, aes(x="x", y="y"))
    + geom_bin2d()
    + coord_fixed()
    + scale_fill_viridis(option="plasma")
    + labs(title="Plasma, continuous")
)

14.4.5. Zooming#

There are three ways to control the plot limits:

Adjusting what data are plotted.
Setting the limits in each scale.
Setting xlim and ylim in coord_cartesian().

We’ll demonstrate these options in a series of plots. The first plot shows the relationship between engine size and fuel efficiency, coloured by type of drive train. The second plot shows the same variables, but subsets the data that are plotted. Subsetting the data has affected the x and y scales as well as the smooth curve.

(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(aes(color="drv"))
    + geom_smooth(method="loess")
)

mpg_condition = (
    (mpg["displ"] >= 5) & (mpg["displ"] <= 6) & (mpg["hwy"] >= 10) & (mpg["hwy"] <= 25)
)

(
    ggplot(mpg.loc[mpg_condition], aes(x="displ", y="hwy"))
    + geom_point(aes(color="drv"))
    + geom_smooth(method="loess")
)

Let’s compare these to the two plots below where the first plot sets the limits on individual scales and the second plot sets them in coord_cartesian(). We can see that reducing the limits is equivalent to subsetting the data. Therefore, to zoom in on a region of the plot, it’s generally best to use coord_cartesian().

(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(aes(color="drv"))
    + geom_smooth(method="loess")
    + scale_x_continuous(limits=(5, 6))
    + scale_y_continuous(limits=(10, 25))
)

(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(aes(color="drv"))
    + geom_smooth(method="loess")
    + coord_cartesian(xlim=(5, 6), ylim=(10, 25))
)

On the other hand, setting the limits on individual scales is generally more useful if you want to expand the limits, e.g., to match scales across different plots. For example, if we extract two classes of cars and plot them separately, it’s difficult to compare the plots because all three scales (the x-axis, the y-axis, and the colour aesthetic) have different ranges.

suv = mpg.loc[mpg["class"] == "suv"]
compact = mpg.loc[mpg["class"] == "compact"]
(ggplot(suv, aes(x="displ", y="hwy", color="drv")) + geom_point())

(ggplot(compact, aes(x="displ", y="hwy", color="drv")) + geom_point())

One way to overcome this problem is to share scales across multiple plots, training the scales with the limits of the full data.

x_scale = scale_x_continuous(limits=mpg["displ"].agg(["max", "min"]).tolist())
y_scale = scale_y_continuous(limits=mpg["hwy"].agg(["max", "min"]).tolist())
col_scale = scale_color_discrete(limits=mpg["drv"].unique())

(
    ggplot(suv, aes(x="displ", y="hwy", color="drv"))
    + geom_point()
    + x_scale
    + y_scale
    + col_scale
)

(
    ggplot(compact, aes(x="displ", y="hwy", color="drv"))
    + geom_point()
    + x_scale
    + y_scale
    + col_scale
)

In this particular case, you could have simply used faceting, but this technique is useful more generally, if for instance, you want to spread plots over multiple pages of a report.

14.4.6. Exercises#

What is the first argument to every scale? How does it compare to labs()?
Change the display of the presidential terms by:

a. Combining the two variants that customize colors and x axis breaks. b. Improving the display of the y axis. c. Labelling each term with the name of the president. d. Adding informative plot labels. e. Placing breaks every 4 years (this is trickier than it seems!).

14.5. Themes#

Finally, you can customise the non-data elements of your plot with a theme:

(
    ggplot(mpg, aes(x="displ", y="hwy"))
    + geom_point(aes(color="class"))
    + geom_smooth(se=False)
    + theme_grey()
)

lets-plot includes several built-in themes that you can find here. You can also create your own themes, if you are trying to match a particular corporate or journal style.

Here’s an example of changing multiple theme() settings:

(
    ggplot(mpg, aes(x="displ", color="drv"))
    + geom_density(size=2)
    + ggtitle("Density of drives")
    + theme(
        axis_line=element_line(size=4),
        axis_ticks_length=10,
        axis_title_y="blank",
        legend_position=[1, 1],
        legend_justification=[1, 1],
        panel_background=element_rect(color="black", fill="#eeeeee", size=2),
        panel_grid=element_line(color="black", size=1),
    )
)

14.5.1. Exercises#

Make the axis labels of your plot blue and bolded.

14.6. Layout#

So far we talked about how to create and modify a single plot. What if you have multiple plots you want to lay out in a certain way? You can do that. To place two plots next to each other, you can simply put them in a list and call gggrid() on the list. Note that you first need to create the plots and save them as objects (in the following example they’re called p1 and p2).

p1 = ggplot(mpg, aes(x="displ", y="hwy")) + geom_point() + labs(title="Plot 1")
p2 = ggplot(mpg, aes(x="drv", y="hwy")) + geom_boxplot() + labs(title="Plot 2")
gggrid([p1, p2])

14.7. Saving plots to file#

There are lots of output options to choose from to save your file to. Remember that, for graphics, vector formats are generally better than raster formats. In practice, this means saving plots in svg or pdf formats over jpg or png file formats. The svg format works in a lot of contexts (including Microsoft Word) and is a good default. To choose between formats, just supply the file extension and the file type will change automatically, eg “chart.svg” for svg or “chart.png” for png (thought note that raster formats often have extra options, like how many dots per inch to use).

Let’s try this out using the figure we made in the previous exercise, p1. path="." just drops the file in the current directory.

ggsave(p1, "chart.svg", path=".")

'/home/runner/work/python4DS/python4DS/chart.svg'

To double check this has worked, let’s use the terminal. We’ll try the command ls, which lists everything in directory, and grep *.svg to pull out any files that end in .svg from what is returned by ls. These are strung together as commands by a |. (Note that the leading exclamation mark below just tells the software that builds this book to use the terminal.)

!ls | grep *.svg

chart.svg

14.8. Summary#

In this chapter you’ve learned about adding plot labels such as title, subtitle, caption as well as modifying default axis labels, using annotation to add informational text to your plot or to highlight specific data points, customising the axis scales, and changing the theme of your plot. You’ve also learned about combining multiple plots in a single graph using both simple and complex plot layouts.

While you’ve so far learned about how to make many different types of plots and how to customise them using a variety of techniques, we’ve barely scratched the surface of what you can create with lets-plot.

The best place to go for further information is the lets-plot dcoumentation.