(categorical-data)=
# Categorical Data

## Introduction

In this chapter, we'll introduce how to work with categorical variables—that is, variables that have a fixed and known set of possible values. This chapter is enormously indebted to the **pandas** [documentation](https://pandas.pydata.org/).


In [None]:
# remove cell
import matplotlib_inline.backend_inline
import matplotlib.pyplot as plt

# Plot settings
plt.style.use("https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt")
matplotlib_inline.backend_inline.set_matplotlib_formats("svg")

### Prerequisites

This chapter will use the **pandas** data analysis package.

## The Category Datatype

Everything in Python has a type, even the data in **pandas** data frame columns. While you may be more familiar with numbers and even strings, there is also a special data type for categorical data called `Categorical`. There are some benefits to using categorical variables (where appropriate):

- they can keep track even when elements of the category isn't present, which can sometimes be as interesting as when they are (imagine you find no-one from a particular school goes to university)
- they can use vastly less of your computer's memory than encoding the same information in other ways
- they can be used efficiently with modelling packages, where they will be recognised as potential 'dummy variables', or with plotting packages, which will treat them as discrete values
- you can order them (for example, "neutral", "agree", "strongly agree")

All values of categorical data for a **pandas** column are either in the given categories or take the value `np.nan`. 

## Creating Categorical Data

Let's create a categorical column of data:

In [None]:
import numpy as np
import pandas as pd

df = pd.DataFrame({"A": ["a", "b", "c", "a"]})

df["A"] = df["A"].astype("category")
df["A"]

Notice that we get some additional information at the bottom of the shown series: we get told that not only is this a categorical column type, but it has three values 'a', 'b', and 'c'.

You can also use special functions, such as `pd.cut()`, to groups data into discrete bins. Here's an example where specify the labels for the categories directly:


In [None]:
df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})
labels = [f"{i} - {i+9}" for i in range(0, 100, 10)]
df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head()

In the example above, the `group` column is of categorical type.

Another way to create a categorical variable is directly using the `pd.Categorical()` function:

In [None]:
raw_cat = pd.Categorical(
    ["a", "b", "c", "a", "d", "a", "c"], categories=["b", "c", "d"]
)
raw_cat

We can then enter this into a data frame:

In [None]:
df = pd.DataFrame(raw_cat, columns=["cat_type"])
df["cat_type"]

Note that NaNs appear for any value that *isn't* in the categories we specified—you can find more on this in {ref}`missing-values`.

You can also create ordered categories:

In [None]:
ordered_cat = pd.Categorical(
    ["a", "b", "c", "a", "d", "a", "c"],
    categories=["a", "b", "c", "d"],
    ordered=True,
)
ordered_cat

Another useful function is qcut, which provides a categorical breakdown according to a given number of quantiles (eg 4 produces quartiles):

In [None]:
pd.qcut(range(1, 10), 4)

```{admonition} Exercise
Apply the 5-point Likert scale (Strongly disagree, Disagree, Neither Agree nor Disagree, Agree, Strongly Disagree) to the data generated by `np.random.standard_normal(size=100)` using `pd.qcut` with the keyword argument `retbins=True`.
```

## Working with Categories

Categorical data has a `categories` and a `ordered` property; these list the possible values and whether the ordering matters or not respectively. These properties are exposed as `.cat.categories` and `.cat.ordered`. If you don’t manually specify categories and ordering, they are inferred from the passed arguments.

Let's see some examples:

In [None]:
df["cat_type"].cat.categories

In [None]:
df["cat_type"].cat.ordered

If categorical data is ordered (ie `.cat.ordered == True`), then the order of the categories has a meaning and certain operations are possible: you can sort values (with `.sort_values`), and apply `.min` and `.max`.

### Renaming Categories

Renaming categories is done using the `rename_categories()` method (which works with a list or a dictionary).

In [None]:
df["cat_type"] = df["cat_type"].cat.rename_categories(["alpha", "beta", "gamma"])
df

Quite often, you'll run into a situation where you want to add a category. You can do this with `.add_categories()`:

In [None]:
df["cat_type"] = df["cat_type"].cat.add_categories(["delta"])
df["cat_type"]

Similarly, there is a `.remove_categories()` function and a `.remove_unused_categories()` function. `.set_categories()` adds and removes categories in one fell swoop. One of the nice properties of set categories is that  Remember that you need to do `df["columnname"].cat` before calling any cat(egory) functions though.

## Operations on Categories

As noted, ordered categories will already undergo some operations. But there are some that work on any set of categories. Perhaps the most useful is `value_counts()`.

In [None]:
df["cat_type"].value_counts()

Note that even though 'delta' doesn't appear at all, it gets a count (of zero). This tracking of missing values can be quite handy.

`mode()` is another one:

In [None]:
df["cat_type"].mode()

And if your categorical column happens to consist of *elements* that can undergo operations, those same operations will still work. For example,

In [None]:
time_df = pd.DataFrame(
    pd.Series(pd.date_range("2015/05/01", periods=5, freq="M"), dtype="category"),
    columns=["datetime"],
)
time_df

In [None]:
time_df["datetime"].dt.month

Finally, if you ever need to translate your actual data types in your categorical column into a code, you can use `.cat.codes` to get unique codes for each value.

In [None]:
time_df["datetime"].cat.codes