Intro to Data Visualisation
Intro to Data Visualisation#
Here you’ll see how to make plots that present data in an engaging and informative way.
There are a plethora of options (and packages) for data visualisation using code. First, though a note about the different philosophies of data visualisation. There are broadly two categories of approach to using code to create data visualisations: imperative, where you build what you want, and declarative, where you say what you want. Choosing which to use involves a trade-off: imperative libraries offer you flexibility but at the cost of some verbosity; declarative libraries offer you a quick way to plot your data, but only if it’s in the right format to begin with, and customisation may be more difficult.
There are also different purposes of data visualisation. It can be useful to bear in mind the three broad categories of visualisation that are out there:
Python excels at exploratory and scientific visualisation. The tools for narrative visualisation are not as good as they could be for making common chart types efficiently, but the endless customisability of one particular Python package (matplotlib) means you can always get the effect you need (with some work).
The first of the three kinds of vis, exploratory visualisation, is the kind that you do when you’re looking and data and trying to understand it. Just plotting the data is a really good strategy for getting a feel for any issues there might be. This is perhaps most famously demonstrated by Anscombe’s quartet: four different datasets with the same mean, standard deviation, and correlation but very different data distributions.
(First let’s import the packages we’ll need:)
import numpy as np import pandas as pd import matplotlib.pyplot as plt # Set seed for random numbers seed_for_prng = 78557 prng = np.random.default_rng(seed_for_prng) # prng=probabilistic random number generator
Exploratory visualisation is usually quick and dirty, and flexible too. Some exploratory data viz can be automated, and some of the packages we saw in the chapter on Exploratory Data Analysis can do this. For an EDA package that’s explicitly built with visalisation in mind, check out SweetViz. Beyond you and perhaps your co-authors/collaborators, not many other people should be seeing your exploratory visualisation.
The second kind, scientific visualisation, is the prime cut of your exploratory visualisation. It’s the kind of plot you might include in a more technical paper, the picture that says a thousand words. I often think of the first image of a black hole Akiyama et al.  as a prime example of this. You can get away with having a high density of information in a scientific plot and, in short format journals, you may need to. The journal Physical Review Letters, which has an 8 page limit, has a classic of this genre in more or less every issue. Ensuring that important values can be accurately read from the plot is especially important in these kinds of charts. But they can also be the kind of plot that presents the killer results in a study; they might not be exciting to people who don’t look at charts for a living, but they might be exciting and, just as importantly, understandable by your peers.
The third and final kind is narrative visualisation. This is the one that requires the most thought in the step where you go from the first view to the end product. It’s a visualisation that doesn’t just show a picture, but gives an insight. These are the kind of visualisations that you might see in the Financial Times, The Economist, or on the BBC News website. They come with aids that help the viewer focus on the aspects that the creator wanted them to (you can think of these aids or focuses as doing for visualisation what bold font does for text). They’re well worth using in your work, especially if you’re trying to communicate a particular narrative, and especially if the people you’re communicating with don’t have deep knowledge of the topic. You might use them in a paper that you hope will have a wide readership, in a blog post summarising your work, or in a report intended for a policymaker.
You can find more information on the topic in the Narrative Data Visualisation chapter.
Quick guide to data visualisation#
Addressing data visualisation, a huge topic in itself, is definitely out of scope for this book! But it’s worth discussing a few general pointers at the outset that will serve you very well if you follow them.
A picture may tell a 1000 words, but you’ve got to be a bit careful about what those words are. The first question you should ask yourself when it comes to data visualisation is ‘what does this plot tell the viewer?’, ie what do you want people to take away from your chart. That nugget of information should be as apparent as possible from the plot. Then you want to ensure that people do take away what you meant them to; the viewer should be left in little doubt about what you are saying.
Another factor to bear in mind is that papers typically don’t have more than, say, ten plots in them–and frequently fewer than that. So each one must count and advance the narrative of your work somehow. (Easier to say, hard to do in practice.) As an example, if you have data that are normally distributed, and you want to show this, it’s probably not worth showing it on a plot. But if you had two distributions whose differences were important for the overall story you were telling, that might be something to highlight.
Then there are more practical matters: is this plot better done as a scatter plot or a line? Should I stack my bar chart or split out the contributions? Those questions address the type of plot you’re creating. For example, if you have observations that are independent from one another, with no auto-correlation along the x-axis, a scatter plot is more appropriate than a line chart. However, for time series, which tend to exhibit a high degree of auto-correlation, a line chart could be just the thing. As well as the overall type, for example scatter plot, you can think about adding more information through the use of colours, shapes, sizes, and so on. But my advice is always to be sparing with extra dimensions of information as it very quickly becomes difficult to read. In most cases, an x-axis, a y-axis, and, usually, one other dimension (eg colour) will be the best option.
Once you’ve decided on the type of chart, you can then think about smaller details. Unfortunately, lack of labels is endemic in economics (“percent of what!?”, I cry at least three times a day). Always make what you’re plotting clear and, if it has units, express them (eg “Salary (2015 USD)”). Think carefully about the tick labels to use too; you’ll want something quite different for linear versus log plots. Titles can be helpful too, if the axes labels and the chart by themselves don’t tell the whole story.
Then, if there are very specific features you’d like to draw attention to, you can achieve this with text labels (as used liberally in the data visualisations you’ll see in newspapers like the Financial Times), greying out all but the most interesting data point, etc.; anything that singles out one part of the chart as the interesting one. A common trick is to plot the less important features with greater transparency and the important line/point/bar with solid colour. These all have the effect of drawing the eye straight to where it should spend the most time.
This is just the briefest of brief overviews of what you should bear in mind for good visualisation; I highly recommend the short and delightful Fundamentals of Data visualisation if you’d like to know more.
In terms of further resources to help you choose the right plot for the occassion, you can’t go too far wrong than the Financial Times Visual Vocabulary of charts. And, please, please use vector graphics whenever you can!
This section has benefitted from this blog piece on visualisation and colour, and you can find more information there.
Colours often make a chart come alive, but, when encoding differences with colour, think carefully about what would serve your audience and message best. It’s best not to use colour randomly, but to choose colours that either add information to the chart or get out of the way of the message. Often, you’ll want to draw your colours from a ‘colour palette’, a collection of different colours that work together to create a particular effect. The best colour palettes take into account that colour blindness is a problem for many people, and they remain informative even when converted to greyscale.
For (unordered) categorical data, visually distinct colour palettes (also known as qualitative palettes) are better. The basic rule is that you should use distinct hues when your values don’t have an inherent order or range. Note that this does not include Likert scales (“strongly agree, agree, neutral, disagree, strongly disagree”), because even though there are distinct categories, there is an order to the possible responses.
Here are some examples of the qualitative hues available in the visualisation package matplotlib.
Continuous Colour Scales#
Continuously varying data need a sequential colour scale, but there are two types: sequential and diverging
For data that vary from low to high, you can use a sequential colourmap. Best practice is to use a sequential colourmap that is perceptually uniform. The authors of the Python package matplotlib developed several perceptually uniform colourmaps that are now widely used, not just in Python, but in other languages and contexts too [Nuñez, Anderton, and Renslow, 2018]. These are the ones built-in to matplotlib:
Do not use the JET colourmap: it is very much not perceptually uniform. If you do want a rainbow-like sequential and perceptually uniform colourmap, then turbo, developed by Google, is as good a choice as you’re going to find. You can find turbo within matplotlib.
Sometimes a diverging colourmap will be more appropriate for your data. These are also called bipolar or double-ended color scales and, instead of just going from low to high, they tend to have a default middle value (often brighter) with values either side that are darker in different hues. Diverging color scales are often used to visualise negative and positive differences relative to zero, election results, or Likert scales (for example, “strongly agree, agree, neutral, disagree, strongly disagree”).
These are the built-in ones in matplotlib:
So how do you choose between a diverging or sequential colour scale? Divering colour scales work better when i) there’s a meaningful middle point, ii) there are extremes that you want to emphasise, iii) when differences are more of the story than levels, and iv) when you don’t mind that people will have to put in a bit of extra effort to understand the chart relative to the typically more intuitive sequential colour scale.
Finally, this book uses a colour-blind friendly qualitative scheme (you can find the list of colours in this file).
Libraries for Data Visualisation#
In the rest of this chapter, we’ll take a look at making visualisations with several of these libraries. But first, let’s introduce them.
The most important and widely used data visualisation library in Python is matplotlib. It was used to make the first image of a black hole Akiyama et al.  and to image the first empirical evidence of gravitational waves Abbott et al. . matplotlib is an imperative visualisation library: you specify each part of what you want individually to build up an entire picture. It’s perhaps the easiest to get started with and the most difficult to master. As well as making plots, it can also be used to make diagrams, animations, and 3D visualisations (which you should use sparingly, if at all).
seaborn is a popular declarative library that builds on maplotlib and works especially well with data that are in a tidy format (one row per observation, one column per variable). This book recommends a mixture of seaborn and matplotlib for most needs. especially for graphs that are not interactive.
plotly express is another declarative-leaning library that’s suited to web apps and dashboards. This comes highly recommended if you need interactivity out of the box.
plotnine is another declarative plotting library. It adopts a grammar of graphics approach. What this means is that all visualisations begin with the same command,
ggplot, and are combinations of layers that address different aspects of a plot, for example points or lines, scale, labels, and so on. It’ll be clearer when we come to an example.
pandas also has built-in plotting functions that you will have seen in the data analysis part of this book. They are of the form
* could be, for example,
scatter. These are convenience functions for making a quick plot of your data and they actually use matplotlib; we won’t see much of these here but you can find them in the data analysis chapter.
We’re going to start this chapter by finding out a little bit more about each of these data visualisation libraries before looking at some examples of how to make specific plots with all the main libraries. We’ll end by looking at some more interesting and advanced plots.
Other Data Visualisation Tools#
There are tons of data visualisation libraries in Python, so many that most cannot be featured in great detail. Here are a few more that may be worth looking into depending on what you need.
Here’s a quick run down of the other libraries that are available:
proplot aims to be “A lightweight matplotlib wrapper for making beautiful, publication-quality graphics”, though the style is more similar to how people might make plots in the hard sciences rather than the social sciences. The point of this library is to take some of the verbosity out of matplotlib.
if you’re using very big data in machine learning models, it might be worth looking at Facebook’s hiplot
Seaborn image does for image data what seaborn does for numerical and categorical data
Lit provides an open-source platform for visualization and understanding of NLP models (very impressive)
Wordcloud does exactly what you’d expect (but use word clouds very sparingly!)
VisPy for very large datasets plotted with WebGL and GPU acceleration.
PyQtGraph, a pure-Python graphics library for PyQt5/PySide2 and intended for use in (heavy) mathematics / scientific / engineering applications (not very user friendly).
bokeh offers interactive web plotting in Python.
HoloViews, a library dsigned to make data analysis and visualization seamless and simple with very concise commands (builds on bokeh and matplotlib).
YellowBrick for visualisations of machine learning models and metrics.
facets for quickly visualising machine learning features (aka regressors). Also useful for exploratory data analysis.
chartify, Spotify’s quick plotting library for data scientists.
scikit-plot offers plotting tools designed around Python’s wildy popular scikit-learn machine learning library.
themepy is an open source theme selector / creator and aesthetic manager for Matplotlib.
scienceplots provides scientific plotting styles–some associated with specific journals–for Matplotlib.
colour provides professional level colour tools for Python.
palettable has extra colour palettes that work well with Matplotlib.
colorcet is a collection of perceptually uniform colourmaps.
missingno for visualization of missing data.
bashplotlib, for when you want to make visualisations directly from the command line (I don’t imagine this will be very often, but always good to know the option is there!)
You can see an overview of all Python plotting tools at PyViz.
If you know:
✅ a little bit about how to use data visualisation; and
✅ what some of the most popular libraries for data vis are in Python
then you are well on your way to being a whizz with data vis!