6. Tidy Data#

6.1. Introduction#

In this chapter, you will learn a consistent way to organise your data in Python using the principle known as tidy data. Tidy data is not appropriate for everything, but for a lot of analysis and a lot of tabular data it will be what you need. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.

In this chapter, you’ll first learn the definition of tidy data and see it applied to simple toy dataset. Then we’ll dive into the main tool you’ll use for tidying data: melting. Melting allows you to change the form of your data, without changing any of the values. We’ll finish up with a discussion of usefully untidy data, and how you can create it if needed.

If you particularly enjoy this chapter and want to learn more about the underlying theory, you can learn more in the Tidy Data paper published in the Journal of Statistical Software.

6.1.1. Prerequisites#

This chapter will use the pandas data analysis package.

6.2. Tidy Data#

There are three interrelated features that make a dataset tidy:

  1. Each variable is a column; each column is a variable.

  2. Each observation is row; each row is an observation.

  3. Each value is a cell; each cell is a single value.

The figure below shows this:

Why ensure that your data is tidy? There are two main advantages:

  1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, it’s easier to learn the tools that work with it because they have an underlying uniformity. Some tools, for example data visualisation package seaborn, are designed with tidy data in mind.

  2. There’s a specific advantage to placing variables in columns because it allows you to take advantage of pandas’ vectorised operations (operations that are more efficient).

Tidy data aren’t going to be appropriate every time and in every case, but they’re a really, really good default for tabular data. Once you use it as your default, it’s easier to think about how to perform subsequent operations.

Having said that tidy data are great, they are, but one of pandas’ advantages relative to other data analysis libraries is that it isn’t too tied to tidy data and can navigate awkward non-tidy data manipulation tasks happily too.

6.3. Make Data Tidy with pandas#

6.3.1. Melt#

melt() can help you go from untidy to tidy data (from wide data to long data), and is a really good one to remember.

Here’s an example of it in action:

import pandas as pd

df = pd.DataFrame(
    {
        "first": ["John", "Mary"],
        "last": ["Doe", "Bo"],
        "job": ["Nurse", "Economist"],
        "height": [5.5, 6.0],
        "weight": [130, 150],
    }
)
print("\n Unmelted: ")
print(df)
print("\n Melted: ")
df.melt(id_vars=["first", "last"], var_name="quantity", value_vars=["height", "weight"])
 Unmelted: 
  first last        job  height  weight
0  John  Doe      Nurse     5.5     130
1  Mary   Bo  Economist     6.0     150

 Melted: 
first last quantity value
0 John Doe height 5.5
1 Mary Bo height 6.0
2 John Doe weight 130.0
3 Mary Bo weight 150.0

Exercise

Perform a melt() that uses job as the id instead of first and last.

6.3.2. Wide to Long#

If you don’t want the headscratching of melt(), there’s also wide_to_long(), which is really useful for typical data cleaning cases where you have data like this:

import numpy as np

df = pd.DataFrame(
    {
        "A1970": {0: "a", 1: "b", 2: "c"},
        "A1980": {0: "d", 1: "e", 2: "f"},
        "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
        "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
        "X": dict(zip(range(3), np.random.randn(3))),
        "id": dict(zip(range(3), range(3))),
    }
)
df
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -1.295376 0
1 b e 1.2 1.3 0.972266 1
2 c f 0.7 0.1 1.059995 2

i.e. data where there are different variables and time periods across the columns. Wide to long is going to let us give info on what the stubnames are (‘A’, ‘B’), the name of the variable that’s always across columns (here, a year), any values (X here), and an id column.

pd.wide_to_long(df, stubnames=["A", "B"], i="id", j="year")
X A B
id year
0 1970 -1.295376 a 2.5
1 1970 0.972266 b 1.2
2 1970 1.059995 c 0.7
0 1980 -1.295376 d 3.2
1 1980 0.972266 e 1.3
2 1980 1.059995 f 0.1

6.3.3. Stack and Unstack#

Stack, stack() is a shortcut for taking a single type of wide data variable from columns and turning it into a long form dataset, but with an extra index.

Unstack, unstack() unsurprisingly does the same operation, but in reverse.

Let’s define a multi-index data frame to demonstrate this:

tuples = list(
    zip(
        *[
            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df
A B
first second
bar one -0.745456 0.168259
two 0.313888 0.289276
baz one -0.124486 0.353456
two -0.790333 0.274120
foo one -0.236104 1.761171
two 1.220715 0.600996
qux one -0.374970 -1.243281
two -0.524140 0.456356

Let’s stack this to create a tidy dataset:

df = df.stack()
df
first  second   
bar    one     A   -0.745456
               B    0.168259
       two     A    0.313888
               B    0.289276
baz    one     A   -0.124486
               B    0.353456
       two     A   -0.790333
               B    0.274120
foo    one     A   -0.236104
               B    1.761171
       two     A    1.220715
               B    0.600996
qux    one     A   -0.374970
               B   -1.243281
       two     A   -0.524140
               B    0.456356
dtype: float64

This has automatically created a multi-layered index but that can be reverted to a numbered index using df.reset_index()

Now let’s see unstack but, instead of unstacking the ‘A’, ‘B’ variables we began with, let’s unstack the ‘first’ column by passing level=0 (the default is to unstack the innermost index). This diagram shows what’s going on:

And here’s the code:

df.unstack(level=0)
first bar baz foo qux
second
one A -0.745456 -0.124486 -0.236104 -0.374970
B 0.168259 0.353456 1.761171 -1.243281
two A 0.313888 -0.790333 1.220715 -0.524140
B 0.289276 0.274120 0.600996 0.456356

Exercise

What happens if you unstack to level=1 instead? What about applying unstack() twice?

6.3.4. Pivoting data from tidy to, err, untidy data#

At the start of this chapter, we said you should use tidy data–one row per observation, one column per variable–whenever you can. But there are times when you will want to take your lovingly prepared tidy data and pivot it into a wider format. pivot() and pivot_table() help you to do that.

This can be especially useful for time series data, where operations like shift() or diff() are typically applied assuming that an entry in one row follows (in time) from the one above. Here’s an example:

import numpy as np

data = {
    "value": np.random.randn(20),
    "variable": ["A"] * 10 + ["B"] * 10,
    "category": np.random.choice(["type1", "type2", "type3", "type4"], 20),
    "date": (
        list(pd.date_range("1/1/2000", periods=10, freq="M"))
        + list(pd.date_range("1/1/2000", periods=10, freq="M"))
    ),
}
df = pd.DataFrame(data, columns=["date", "variable", "category", "value"])
df.sample(5)
date variable category value
19 2000-10-31 B type2 -1.361564
7 2000-08-31 A type3 0.854984
16 2000-07-31 B type1 1.108506
13 2000-04-30 B type1 1.496067
11 2000-02-29 B type4 -1.280326

If we just run shift() on this, it’s going to shift variable B’s and A’s together even though these overlap in time. So we pivot to a wider format (and then we can shift safely).

df.pivot(index="date", columns="variable", values="value").shift(1)
variable A B
date
2000-01-31 NaN NaN
2000-02-29 1.556620 -1.507454
2000-03-31 0.002666 -1.280326
2000-04-30 0.725275 -1.106560
2000-05-31 -1.320918 1.496067
2000-06-30 1.805098 0.286973
2000-07-31 1.397689 0.162781
2000-08-31 0.529012 1.108506
2000-09-30 0.854984 0.398667
2000-10-31 0.787166 1.872951

To go back to the original structure, albeit without the category columns, apply .unstack().reset_index().

Exercise

Perform a pivot() that applies to both the variable and category columns. (Hint: remember that you will need to pass multiple objects via a list.)