(data-tidy)=
# Tidy Data

## Introduction

In this chapter, you will learn a consistent way to organise your data in Python using the principle known as *tidy data*. Tidy data is not appropriate for everything, but for a lot of analysis and a lot of tabular data it will be what you need. Getting your data into this format requires some work up front, but that work pays off in the long term. Once you have tidy data, you will spend much less time munging data from one representation to another, allowing you to spend more time on the data questions you care about.

In this chapter, you'll first learn the definition of tidy data and see it applied to simple toy dataset. Then we'll dive into the main tool you'll use for tidying data: melting. Melting allows you to change the form of your data, without changing any of the values. We'll finish up with a discussion of usefully untidy data, and how you can create it if needed.

If you particularly enjoy this chapter and want to learn more about the underlying theory, you can learn more in the [Tidy Data](https://www.jstatsoft.org/article/view/v059i10) paper published in the Journal of Statistical Software.


In [None]:
# remove cell
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline

# Plot settings
plt.style.use("https://github.com/aeturrell/python4DS/raw/main/plot_style.txt")
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

### Prerequisites

This chapter will use the **pandas** data analysis package.

## Tidy Data

There are three interrelated features that make a dataset tidy:

1.  Each variable is a column; each column is a variable.
2.  Each observation is row; each row is an observation.
3.  Each value is a cell; each cell is a single value.

The figure below shows this:

![](https://d33wubrfki0l68.cloudfront.net/6f1ddb544fc5c69a2478e444ab8112fb0eea23f8/91adc/images/tidy-1.png)

Why ensure that your data is tidy? There are two main advantages:

1.  There's a general advantage to picking one consistent way of storing data.
    If you have a consistent data structure, it's easier to learn the tools that work with it because they have an underlying uniformity. Some tools, for example data visualisation package **seaborn**, are designed with tidy data in mind.

2.  There's a specific advantage to placing variables in columns because it allows you to take advantage of **pandas**' vectorised operations (operations that are more efficient).


Tidy data aren't going to be appropriate *every* time and in every case, but they're a really, really good default for tabular data. Once you use it as your default, it's easier to think about how to perform subsequent operations.

Having said that tidy data are great, they are, but one of **pandas**' advantages relative to other data analysis libraries is that it isn't *too* tied to tidy data and can navigate awkward non-tidy data manipulation tasks happily too.

There are two common problems you find in data that are ingested that make them not tidy:

1. A variable might be spread across multiple columns.
2. An observation might be scattered across multiple rows.

For the former, we need to "melt" the wide data, with multiple columns, into long data.

For the latter, we need to unstack or pivot the multiple rows into columns (ie go from long to wide.)

We'll see both below.

## Tools to Make Data Tidy with **pandas**

### Melt

`melt()` can help you go from "wider" data to "longer" data, and is a *really* good one to remember.

![](https://pandas.pydata.org/docs/_images/reshaping_melt.png)

Here's an example of it in action:

In [None]:
import pandas as pd

df = pd.DataFrame(
    {
        "first": ["John", "Mary"],
        "last": ["Doe", "Bo"],
        "job": ["Nurse", "Economist"],
        "height": [5.5, 6.0],
        "weight": [130, 150],
    }
)
print("\n Unmelted: ")
print(df)
print("\n Melted: ")
df.melt(id_vars=["first", "last"], var_name="quantity", value_vars=["height", "weight"])

```{admonition} Exercise
Perform a `melt()` that uses `job` as the id instead of `first` and `last`.
```

How does this relate to tidy data? Sometimes you'll have a variable spread over multiple columns that you want to turn tidy. Let's look at this example that uses cases of [tuburculosis from the World Health Organisation](https://www.who.int/teams/global-tuberculosis-programme/data).

First let's open the data and look at the top of the file.

In [None]:
df_tb = pd.read_parquet(
    "https://github.com/aeturrell/python4DS/raw/refs/heads/main/data/who_tb_cases.parquet"
)
df_tb.head()

You can see that we have two columns for a single variable, year. Let's now melt this.

In [None]:
df_tb.melt(
    id_vars=["country"],
    var_name="year",
    value_vars=["1999", "2000"],
    value_name="cases",
)

We now have one observation per row, and one variable per column: tidy!

### A simpler wide to long

If you don't want the headscratching of `melt()`, there's also `wide_to_long()`, which is really useful for typical data cleaning cases where you have data like this:

In [None]:
import numpy as np

df = pd.DataFrame(
    {
        "A1970": {0: "a", 1: "b", 2: "c"},
        "A1980": {0: "d", 1: "e", 2: "f"},
        "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
        "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
        "X": dict(zip(range(3), np.random.randn(3))),
        "id": dict(zip(range(3), range(3))),
    }
)
df

i.e. data where there are different variables and time periods across the columns. Wide to long is going to let us give info on what the stubnames are ('A', 'B'), the name of the variable that's always across columns (here, a year), any values (X here), and an id column.

In [None]:
pd.wide_to_long(df, stubnames=["A", "B"], i="id", j="year")

### Stack and Unstack

Stack, `stack()` is a shortcut for taking a single type of wide data variable from columns and turning it into a long form dataset, but with an extra index.

![](https://pandas.pydata.org/docs/_images/reshaping_stack.png)

Unstack, `unstack()` unsurprisingly does the same operation, but in reverse.

![](https://pandas.pydata.org/docs/_images/reshaping_unstack.png)

Let's define a multi-index data frame to demonstrate this:

In [None]:
tuples = list(
    zip(
        *[
            ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
            ["one", "two", "one", "two", "one", "two", "one", "two"],
        ]
    )
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df

Let's stack this to create a tidy dataset:

In [None]:
df = df.stack()
df

This has automatically created a multi-layered index but that can be reverted to a numbered index using `df.reset_index()`

Now let's see unstack but, instead of unstacking the 'A', 'B' variables we began with, let's unstack the 'first' column by passing `level=0` (the default is to unstack the innermost index). This diagram shows what's going on:

![](https://pandas.pydata.org/docs/_images/reshaping_unstack_0.png)

And here's the code:

In [None]:
df.unstack(level=0)

```{admonition} Exercise
What happens if you unstack to `level=1` instead? What about applying `unstack()` twice?
```

### Pivoting data from long to wide

`pivot()` and `pivot_table()` help you to sort out data in which a single observation is scattered over multiple rows.

![](https://pandas.pydata.org/docs/_images/reshaping_pivot.png)


Here's an example dataframe where observations are spread over multiple rows:

In [None]:
df_tb_cp = pd.read_parquet(
    "https://github.com/aeturrell/python4DS/raw/refs/heads/main/data/who_tb_case_and_pop.parquet"
)
df_tb_cp.head()

You see that we have, for each year-country, "case" and "population" in different rows.

Now let's pivot this to see the difference:

In [None]:
pivoted = df_tb_cp.pivot(
    index=["country", "year"], columns=["type"], values="count"
).reset_index()
pivoted

Pivots are especially useful for time series data, where operations like `shift()` or `diff()` are typically applied assuming that an entry in one row follows (in time) from the one above. When we do `shift()` we often want to shift a single variable in time, but if a single observation (in this case a date) is over multiple rows, the timing is going go awry. Let's see an example.

In [None]:
import numpy as np

data = {
    "value": np.random.randn(20),
    "variable": ["A"] * 10 + ["B"] * 10,
    "date": (
        list(pd.date_range("1/1/2000", periods=10, freq="ME"))
        + list(pd.date_range("1/1/2000", periods=10, freq="ME"))
    ),
}
df = pd.DataFrame(data, columns=["date", "variable", "value"])
df.sample(5)

If we just run `shift()` on the above, it's going to shift variable B's and A's together even though these overlap in time and are different variables. So we pivot to a wider format (and then we can shift in time safely).

In [None]:
df.pivot(index="date", columns="variable", values="value").shift(1)

```{admonition} Exercise
Why is the first entry NaN?
```


```{admonition} Exercise
Perform a `pivot()` that applies to both the `variable` and `category` columns in the example from above where category is defined such that `df["category"] = np.random.choice(["type1", "type2", "type3", "type4"], 20). (Hint: remember that you will need to pass multiple objects via a list.)
```