Data Visualisation with Seaborn#

Introduction#

Warning

seaborn’s object API is still a work-in-progress, so check the version you’re using carefully and note that the API may change relative to what’s shown here.

Here you’ll see how to make plots quickly using the declarative plotting package seaborn. This package is good if you want to make a standard chart from so-called tidy data where you have one row per observation and one columnn per variable.

Note

We recommend you use letsplot for declarative plotting but seaborn is an excellent alternative that builds on matplotlib and so is more customisable.

seaborn is actually built on top of matplotlib so you can also mix code for the two packages.

The rest of this chapter is indebted to the excellent seaborn object notation documentation.

As ever, we start by bringing in the packages we’ll need:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
import seaborn.objects as so


# Set seed for random numbers
seed_for_prng = 78557
prng = np.random.default_rng(seed_for_prng)  # prng=probabilistic random number generator

Quite a few of the examples we’ll see use a range of additional datasets, so let’s grab those straight away:

tips = sns.load_dataset("tips")
penguins = sns.load_dataset("penguins").dropna()
diamonds = sns.load_dataset("diamonds")
healthexp = sns.load_dataset("healthexp").sort_values(["Country", "Year"]).query("Year <= 2020")

Specifying a plot and mapping data#

The most important command in seaborn is Plot(). You specify plots by instantiating a Plot() object and calling its methods. Let’s see a simple example:

(
    so.Plot(penguins, x="bill_length_mm", y="bill_depth_mm")
    .add(so.Dot())
)
_images/bf7e5c3ee9da6c10688f3e8427630c718bcc4722d686f71b00709fb8160ee76d.png

This code, which produces a scatter plot, should look reasonably familiar. Just as when using seaborn.scatterplot(), we passed a tidy dataframe (penguins) and assigned two of its columns to the x and y coordinates of the plot. But instead of starting with the type of chart and then adding some data assignments, here we started with the data assignments and then added a graphical element.

Setting properties#

The Dot class is an example of a Mark: an object that graphically represents data values. Each mark will have a number of properties that can be set to change its appearance:

(
    so.Plot(penguins, x="bill_length_mm", y="bill_depth_mm")
    .add(so.Dot(color="g", pointsize=4))
)
_images/2e83e277005ca892f8c24529e84ee8e5dc68fbf6c38e563ed80123588689caec.png

Mapping properties

As with seaborn’s functions, it is also possible to map data values to various graphical properties:

(
    so.Plot(
        penguins, x="bill_length_mm", y="bill_depth_mm",
        color="species", pointsize="body_mass_g",
    )
    .add(so.Dot())
)
_images/a40721d9871e3adc8ce5a3522bce16faa5aa57053dee1eccc1ec9346db673a67.png

While this basic functionality is not novel, an important difference from the function API is that properties are mapped using the same parameter names that would set them directly (instead of having hue vs. color, etc.). What matters is where the property is defined: passing a value when you initialize Dot will set it directly, whereas assigning a variable when you set up the Plot() will map the corresponding data.

Beyond this difference, the objects interface also allows a much wider range of mark properties to be mapped:

(
    so.Plot(
        penguins, x="bill_length_mm", y="bill_depth_mm",
        edgecolor="sex", edgewidth="body_mass_g",
    )
    .add(so.Dot(color=".8"))
)
_images/ba1dc807c6914168cb4aaa37bdeaa6de0516139e2c243f6fb69625257abff07a.png

Defining groups#

The Dot mark represents each data point independently, so the assignment of a variable to a property only has the effect of changing each dot’s appearance. For marks that group or connect observations, such as Line, it also determines the number of distinct graphical elements:

(
    so.Plot(healthexp, x="Year", y="Life_Expectancy", color="Country")
    .add(so.Line())
)
_images/aa848290b0858db5eac2e8778711326812b27d01c158b2b2156d854b6e8b588e.png

It is also possible to define a grouping without changing any visual properties, by using group:

(
    so.Plot(healthexp, x="Year", y="Life_Expectancy", group="Country")
    .add(so.Line())
)
_images/ba9125bb21b93f4a412a606bc94ac48c5cf401d7c9f50c1f24a4b5241a6ed51c.png

Transforming data before plotting#

Statistical transformation#

As with many seaborn functions, the objects interface supports statistical transformations. These are performed by Stat objects, such as Agg():

(
    so.Plot(penguins, x="species", y="body_mass_g")
    .add(so.Bar(), so.Agg())
)
_images/444566a9c253c4eb4a164065033e482794ad8d1fe94b039cb28cf741002c0ec7.png

In the function interface, statistical transformations are possible with some visual representations (e.g. seaborn.barplot()) but not others (e.g. seaborn.scatterplot()). The objects interface more cleanly separates representation and transformation, allowing you to compose Mark and Stat objects:

(
    so.Plot(penguins, x="species", y="body_mass_g")
    .add(so.Dot(pointsize=10), so.Agg())
)
_images/3f9e1b4da431733fe4a1dcea6a4b696c7e52de70949fd8a63c8fd27c7a53b1db.png

When forming groups by mapping properties, the Stat transformation is applied to each group separately:

(
    so.Plot(penguins, x="species", y="body_mass_g", color="sex")
    .add(so.Dot(pointsize=10), so.Agg())
)
_images/7b6b4de2d87c16509889ae16d5b00068cb45688f05afd651401e88811efc4174.png

Resolving overplotting#

Some seaborn functions also have mechanisms that automatically resolve overplotting, as when seaborn.barplot “dodges” bars once hue is assigned. The objects interface has less complex default behavior. Bars representing multiple groups will overlap by default:

(
    so.Plot(penguins, x="species", y="body_mass_g", color="sex")
    .add(so.Bar(), so.Agg())
)
_images/ab23563e1beca1a5be28fbf6c34bc9d0935a17ff9d4af1f942315a5094bb70ad.png

Nevertheless, it is possible to compose the Bar mark with the Agg stat and a second transformation, implemented by Dodge:

(
    so.Plot(penguins, x="species", y="body_mass_g", color="sex")
    .add(so.Bar(), so.Agg(), so.Dodge())
)
_images/783d76b42b2ae172e7bd8393172c3734b42581a2e5c26edccf10d6df98b401c7.png

The Dodge class is an example of a Move transformation, which is like a Stat but only adjusts x and y coordinates. The Move classes can be applied with any mark, and it’s not necessary to use a Stat first:

(
    so.Plot(penguins, x="species", y="body_mass_g", color="sex")
    .add(so.Dot(), so.Dodge())
)
_images/133f96f6ea8b479d52e5e4ab98e08449713732d87bd45420b57471751127c6b7.png

It’s also possible to apply multiple Move operations in sequence:

(
    so.Plot(penguins, x="species", y="body_mass_g", color="sex")
    .add(so.Dot(), so.Dodge(), so.Jitter(.3))
)
_images/bf9667a9896e6920585ac4da33519506062da42790dd366c1de7ac60d3e7c8fb.png

Creating variables through transformation#

The Agg stat requires both x and y to already be defined, but variables can also be created through statistical transformation. For example, the Hist stat requires only one of x or y to be defined, and it will create the other by counting observations:

(
    so.Plot(penguins, x="species")
    .add(so.Bar(), so.Hist())
)
_images/cda5d9b82a494e86c045e8e5915ab3e9d4a2139342d1396c3a175319962bc27c.png

The Hist stat will also create new x values (by binning) when given numeric data:

(
    so.Plot(penguins, x="flipper_length_mm")
    .add(so.Bars(), so.Hist())
)
_images/a7b74ca6c9c9b15a1903c5b0b649eadcd4a58b70de5fccbc9e12d1f6673971fb.png

Notice how we used Bars, rather than Bar for the plot with the continuous x axis. These two marks are related, but Bars has different defaults and works better for continuous histograms. It also produces a different, more efficient matplotlib artist. You will find the pattern of singular/plural marks elsewhere. The plural version is typically optimized for cases with larger numbers of marks.

Some transforms accept both x and y, but add interval data for each coordinate. This is particularly relevant for plotting error bars after aggregating:

(
    so.Plot(penguins, x="body_mass_g", y="species", color="sex")
    .add(so.Range(), so.Est(errorbar="sd"), so.Dodge())
    .add(so.Dot(), so.Agg(), so.Dodge())
)
_images/d559d74e972fcfa62a8b96618a4a3956bcf69fee36b4413d9c8e73c8f59f0b47.png

Orienting marks and transforms#

When aggregating, dodging, and drawing a bar, the x and y variables are treated differently. Each operation has the concept of an orientation. The Plot() tries to determine the orientation automatically based on the data types of the variables. For instance, if we flip the assignment of species and body_mass_g, we’ll get the same plot, but oriented horizontally:

(
    so.Plot(penguins, x="body_mass_g", y="species", color="sex")
    .add(so.Bar(), so.Agg(), so.Dodge())
)
_images/82dc5b9e5e56fa3b01378eff455f64bc03905b3e5612923ceef83be651af8b8e.png

Sometimes, the correct orientation is ambiguous, as when both the x and y variables are numeric. In these cases, you can be explicit by passing the orient parameter to Plot.add():

(
    so.Plot(tips, x="total_bill", y="size", color="time")
    .add(so.Bar(), so.Agg(), so.Dodge(), orient="y")
)
_images/05fa880cd1909fb00dc6aa478c01efe56280420f1ac6835b18bf0bcf53be0ba6.png

Building and displaying the plot#

Each example thus far has produced a single subplot with a single kind of mark on it. But Plot() does not limit you to this.

Adding Multiple Layers#

More complex single-subplot graphics can be created by calling Plot.add() repeatedly. Each time it is called, it defines a layer in the plot. For example, we may want to add a scatterplot (now using Dots) and then a regression fit:

(
    so.Plot(tips, x="total_bill", y="tip")
    .add(so.Dots())
    .add(so.Line(), so.PolyFit())
)
_images/57e0b719f8cce8387fcab1e800d8a2397ffcdb60674e26e16576eace03869703.png

Variable mappings that are defined in the Plot() constructor will be used for all layers:

(
    so.Plot(tips, x="total_bill", y="tip", color="time")
    .add(so.Dots())
    .add(so.Line(), so.PolyFit())
)
_images/2bf1094896942b6962fc596a9f9394fd19b857c421a145758d338306e672bb67.png

Layer-specific mappings#

You can also define a mapping such that it is used only in a specific layer. This is accomplished by defining the mapping within the call to Plot.add() for the relevant layer:

(
    so.Plot(tips, x="total_bill", y="tip")
    .add(so.Dots(), color="time")
    .add(so.Line(color=".2"), so.PolyFit())
)
_images/7a0075a7b39b87d6bf4f930723cedc981643a91294ff18f70a09f7ce7ab69e6d.png

Alternatively, define the layer for the entire plot, but remove it from a specific layer by setting the variable to None:

(
    so.Plot(tips, x="total_bill", y="tip", color="time")
    .add(so.Dots())
    .add(so.Line(color=".2"), so.PolyFit(), color=None)
)
_images/7a0075a7b39b87d6bf4f930723cedc981643a91294ff18f70a09f7ce7ab69e6d.png

To recap, there are three ways to specify the value of a mark property: (1) by mapping a variable in all layers, (2) by mapping a variable in a specific layer, and (3) by setting the property directy:

Hide code cell source
from io import StringIO
from IPython.display import SVG
C = sns.color_palette("deep")
f = mpl.figure.Figure(figsize=(7, 3))
ax = f.subplots()
fontsize = 18
ax.add_artist(mpl.patches.Rectangle((.13, .53), .45, .09, color=C[0], alpha=.3))
ax.add_artist(mpl.patches.Rectangle((.22, .43), .235, .09, color=C[1], alpha=.3))
ax.add_artist(mpl.patches.Rectangle((.49, .43), .26, .09, color=C[2], alpha=.3))
ax.text(.05, .55, "Plot(data, 'x', 'y', color='var1')", size=fontsize, color=".2")
ax.text(.05, .45, ".add(Dot(pointsize=10), marker='var2')", size=fontsize, color=".2")
annots = [
    ("Mapped\nin all layers", (.35, .65), (0, 45)),
    ("Set directly", (.35, .4), (0, -45)),
    ("Mapped\nin this layer", (.63, .4), (0, -45)),
]
for i, (text, xy, xytext) in enumerate(annots):
    ax.annotate(
        text, xy, xytext,
        textcoords="offset points", fontsize=14, ha="center", va="center",
        arrowprops=dict(arrowstyle="->", color=C[i]), color=C[i],
    )
ax.set_axis_off()
f.subplots_adjust(0, 0, 1, 1)
f.savefig(s:=StringIO(), format="svg")
SVG(s.getvalue())
_images/bd28e9925cc22a37f90b6360390d1bf5a7d7b798240a951f5064b24e900cc80a.svg

Faceting and pairing subplots#

As with seaborn’s figure-level functions (seaborn.displot(), seaborn.catplot(), etc.), the Plot() interface can also produce figures with multiple “facets”, or subplots containing subsets of data. This is accomplished with the Plot.facet() method:

(
    so.Plot(penguins, x="flipper_length_mm")
    .facet("species")
    .add(so.Bars(), so.Hist())
)
_images/3a6855b64823e12e32fbe1eba6d637f9355b7e31fead48cf3846aebcdb6da3ff.png

Call Plot.facet() with the variables that should be used to define the columns and/or rows of the plot:

(
    so.Plot(penguins, x="flipper_length_mm")
    .facet(col="species", row="sex")
    .add(so.Bars(), so.Hist())
)
_images/9b0792c4aaafe8e981afdf50322bf48a43dd6d9011c1fe7aa6031f597bf673b7.png

You can facet using a variable with a larger number of levels by “wrapping” across the other dimension:

(
    so.Plot(healthexp, x="Year", y="Life_Expectancy")
    .facet(col="Country", wrap=3)
    .add(so.Line())
)
_images/3fd065a2cfeeb7dc90eaaa68967edd49f2ff9350bc63973663be25eb3fd114e4.png

All layers will be faceted unless you explicitly exclude them, which can be useful for providing additional context on each subplot:

(
    so.Plot(healthexp, x="Year", y="Life_Expectancy")
    .facet("Country", wrap=3)
    .add(so.Line(alpha=.3), group="Country", col=None)
    .add(so.Line(linewidth=3))
)
_images/25c3066815b40f4af99c4b29a61ca59e2ce070a1a1910bebf49721e467f55465.png

An alternate way to produce subplots is Plot.pair(). Like seaborn.PairGrid(), this draws all of the data on each subplot, using different variables for the x and/or y coordinates:

(
    so.Plot(penguins, y="body_mass_g", color="species")
    .pair(x=["bill_length_mm", "bill_depth_mm"])
    .add(so.Dots())
)
_images/ade18b3767c7fb8eccce1964562ff3a57fe93d6a8238260a22ab87041b6408f3.png

You can combine faceting and pairing so long as the operations add subplots on opposite dimensions:

(
    so.Plot(penguins, y="body_mass_g", color="species")
    .pair(x=["bill_length_mm", "bill_depth_mm"])
    .facet(row="sex")
    .add(so.Dots())
)
_images/96329791c240930eaa75a3ad015e563a5d9d66e2e8b2d4f85b0febf6612db071.png

Integrating with matplotlib#

There may be cases where you want multiple subplots to appear in a figure with a more complex structure than what Plot.facet() or Plot.pair() can provide. The current solution is to delegate figure setup to matplotlib and to supply the matplotlib object that Plot() should use with the Plot.on() method. This object can be either a matplotlib.axes.Axes, matplotlib.figure.Figure, or matplotlib.figure.SubFigure; the latter is most useful for constructing bespoke subplot layouts:

f = mpl.figure.Figure(figsize=(8, 4))
sf1, sf2 = f.subfigures(1, 2)
(
    so.Plot(penguins, x="body_mass_g", y="flipper_length_mm")
    .add(so.Dots())
    .on(sf1)
    .plot()
)
(
    so.Plot(penguins, x="body_mass_g")
    .facet(row="sex")
    .add(so.Bars(), so.Hist())
    .on(sf2)
    .plot()
)
_images/0ed3d66dce28d99abc49801937d3f0456089e3dd9fd80570b5c461c3726d1868.png

Building and displaying the plot#

An important thing to know is that Plot() methods clone the object they are called on and return that clone instead of updating the object in place. This means that you can define a common plot spec and then produce several variations on it.

So, take this basic specification:

p = so.Plot(healthexp, "Year", "Spending_USD", color="Country")

We could use it to draw a line plot:

p.add(so.Line())
_images/80152264154e68309355a01c6ffc0efa9c53418c2b3e99bb6a21851ab9b26253.png

Or perhaps a stacked area plot:

p.add(so.Area(), so.Stack())
_images/7614592e1877d7bf38dad758342786133331dd243f2a7c2a06931c0ecaf7e58a.png

The Plot methods are fully declarative. Calling them updates the plot spec, but it doesn’t actually do any plotting. One consequence of this is that methods can be called in any order, and many of them can be called multiple times.

When does the plot actually get rendered? Plot is optimized for use in notebook environments. The rendering is automatically triggered when the Plot gets displayed in the Jupyter REPL. That’s why we didn’t see anything in the example above, where we defined a Plot but assigned it to p rather than letting it return out to the REPL.

To see a plot in a notebook, either return it from the final line of a cell or call Jupyter’s built-in display function on the object. The notebook integration bypasses :mod:matplotlib.pyplot entirely, but you can use its figure-display machinery in other contexts by calling Plot.show.

You can also save the plot to a file (or buffer) by calling Plot.save.

Customising the appearance#

The new interface aims to support a deep amount of customisation through Plot, reducing the need to switch gears and use matplotlib functionality directly. (But please be patient; not all of the features needed to achieve this goal have been implemented!)

Parameterising scales#

All of the data-dependent properties are controlled by the concept of a Scale and the Plot.scale() method. This method accepts several different types of arguments. One possibility, which is closest to the use of scales in matplotlib, is to pass the name of a function that transforms the coordinates:

(
    so.Plot(diamonds, x="carat", y="price")
    .add(so.Dots())
    .scale(y="log")
)
_images/fd98c2d838dacd5f7ff18c2de2d234854f1b0d0d7516632f3afa3dc444408356.png

Plot.scale() can also control the mappings for semantic properties like color. You can directly pass it any argument that you would pass to the palette parameter in seaborn’s function interface:

(
    so.Plot(diamonds, x="carat", y="price", color="clarity")
    .add(so.Dots())
    .scale(color="flare")
)
_images/02488a8527cc840a09baacf21c0d35b43b1831b5498ece87edc564d60b3d6a41.png

Another option is to provide a tuple of (min, max) values, controlling the range that the scale should map into. This works both for numeric properties and for colors:

(
    so.Plot(diamonds, x="carat", y="price", color="clarity", pointsize="carat")
    .add(so.Dots())
    .scale(color=("#88c", "#555"), pointsize=(2, 10))
)
_images/b9dc1506c4d63cce78eee44956eee4d4a05909551863e9305ae135ddb729244c.png

For additional control, you can pass a Scale object. There are several different types of Scale, each with appropriate parameters. For example, Continuous lets you define the input domain (norm), the output range (values), and the function that maps between them (trans), while Nominal allows you to specify an ordering:

(
    so.Plot(diamonds, x="carat", y="price", color="carat", marker="cut")
    .add(so.Dots())
    .scale(
        color=so.Continuous("crest", norm=(0, 3), trans="sqrt"),
        marker=so.Nominal(["o", "+", "x"], order=["Ideal", "Premium", "Good"]),
    )
)
_images/1bc53abc130599e421a97abe49fc494af54e468628ddc1383beae59283d26d59.png

Customising legends and ticks#

The Scale objects are also how you specify which values should appear as tick labels / in the legend, along with how they appear. For example, the Continuous.tick method lets you control the density or locations of the ticks, and the Continuous.label method lets you modify the format:

(
    so.Plot(diamonds, x="carat", y="price", color="carat")
    .add(so.Dots())
    .scale(
        x=so.Continuous().tick(every=0.5),
        y=so.Continuous().label(like="${x:.0f}"),
        color=so.Continuous().tick(at=[1, 2, 3, 4]),
    )
)
_images/ae11dc4ee79f06ae3bcb17d72bb71c396f743a6b21d6e3040f7434e66118d542.png

Customising limits, labels, and titles#

Plot() has a number of methods for simple customisation, including Plot.label(), Plot.limit(), and Plot.share():

(
    so.Plot(penguins, x="body_mass_g", y="species", color="island")
    .facet(col="sex")
    .add(so.Dot(), so.Jitter(.5))
    .share(x=False)
    .limit(y=(2.5, -.5))
    .label(
        x="Body mass (g)", y="",
        color=str.capitalize,
        title="{} penguins".format,
    )
)
_images/03436bb913c6c5b646639d4c77e3d1abd4cf56cd01d4c3045bc4d4cafe09bacc.png

Theme customisation#

Finally, Plot() supports data-independent theming through the Plot.theme() method. Currently, this method accepts a dictionary of matplotlib rc parameters. You can set them directly and/or pass a package of parameters from seaborn’s theming functions:

from seaborn import axes_style
so.Plot().theme({**axes_style("whitegrid"), "grid.linestyle": ":"})
_images/d927e34f44f3c7e4ced5d60ed3812f4a3a8a814a74041c6a65bb574d5ace5aab.png