# Categorical Data

## Contents

# Categorical Data#

## Introduction#

In this chapter, we’ll introduce how to work with categorical variables—that is, variables that have a fixed and known set of possible values. This chapter is enormously indebted to the **pandas** documentation.

### Prerequisites#

This chapter will use the **pandas** data analysis package.

## The Category Datatype#

Everything in Python has a type, even the data in **pandas** data frame columns. While you may be more familiar with numbers and even strings, there is also a special data type for categorical data called `Categorical`

. There are some benefits to using categorical variables (where appropriate):

they can keep track even when elements of the category isn’t present, which can sometimes be as interesting as when they are (imagine you find no-one from a particular school goes to university)

they can use vastly less of your computer’s memory than encoding the same information in other ways

they can be used efficienctly with modelling packages, where they will be recognised as potential ‘dummy variables’, or with plotting packages, which will treat them as discrete values

you can order them (for example, “neutral”, “agree”, “strongly agree”)

All values of categorical data for a **pandas** column are either in the given categories or take the value `np.nan`

.

## Creating Categorical Data#

Let’s create a categorical column of data:

```
import numpy as np
import pandas as pd
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["A"] = df["A"].astype("category")
df["A"]
```

```
0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'c']
```

Notice that we get some additional information at the bottom of the shown series: we get told that not only is this a categorical column type, but it has three values ‘a’, ‘b’, and ‘c’.

You can also use special functions, such as `pd.cut()`

, to groups data into discrete bins. Here’s an example where specify the labels for the categories directly:

```
df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})
labels = [f"{i} - {i+9}" for i in range(0, 100, 10)]
df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head()
```

value | group | |
---|---|---|

0 | 39 | 30 - 39 |

1 | 48 | 40 - 49 |

2 | 71 | 70 - 79 |

3 | 48 | 40 - 49 |

4 | 20 | 20 - 29 |

In the example above, the `group`

column is of categorical type.

Another way to create a categorical variable is directly using the `pd.Categorical`

function:

```
raw_cat = pd.Categorical(
["a", "b", "c", "a", "d", "a", "c"], categories=["b", "c", "d"]
)
raw_cat
```

```
[NaN, 'b', 'c', NaN, 'd', NaN, 'c']
Categories (3, object): ['b', 'c', 'd']
```

We can then enter this into a data frame:

```
df = pd.DataFrame(raw_cat, columns=["cat_type"])
df["cat_type"]
```

```
0 NaN
1 b
2 c
3 NaN
4 d
5 NaN
6 c
Name: cat_type, dtype: category
Categories (3, object): ['b', 'c', 'd']
```

Note that NaNs appear for any value that *isn’t* in the categories we specified—you can find more on this in Missing Values.

You can also create ordered categories:

```
ordered_cat = pd.Categorical(
["a", "b", "c", "a", "d", "a", "c"],
categories=["a", "b", "c", "d"],
ordered=True,
)
ordered_cat
```

```
['a', 'b', 'c', 'a', 'd', 'a', 'c']
Categories (4, object): ['a' < 'b' < 'c' < 'd']
```

Another useful function is qcut, which provides a categorical breakdown according to a given number of quantiles (eg 4 produces quartiles):

```
pd.qcut(range(1, 10), 4)
```

```
[(0.999, 3.0], (0.999, 3.0], (0.999, 3.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (5.0, 7.0], (7.0, 9.0], (7.0, 9.0]]
Categories (4, interval[float64, right]): [(0.999, 3.0] < (3.0, 5.0] < (5.0, 7.0] < (7.0, 9.0]]
```

Exercise

Apply the 5-point Likert scale (Strongly disagree, Disagree, Neither Agree nor Disagree, Agree, Strongly Disagree) to the data generated by `np.random.standard_normal(size=100)`

using `pd.qcut`

with the keyword argument `retbins=True`

.

## Working with Categories#

Categorical data has a `categories`

and a `ordered`

property; these list the possible values and whether the ordering matters or not respectively. These properties are exposed as `.cat.categories`

and `.cat.ordered`

. If you don’t manually specify categories and ordering, they are inferred from the passed arguments.

Let’s see some examples:

```
df["cat_type"].cat.categories
```

```
Index(['b', 'c', 'd'], dtype='object')
```

```
df["cat_type"].cat.ordered
```

```
False
```

If categorical data is ordered (ie `.cat.ordered == True`

), then the order of the categories has a meaning and certain operations are possible: you can sort values (with `.sort_values`

), and apply `.min`

and `.max`

.

### Renaming Categories#

Renaming categories is done by assigning new values to the `.cat.categories`

property or by using the `rename_categories()`

method (which works with a list or a dictionary).

```
df["cat_type"].cat.categories = ["alpha", "beta", "gamma"]
df
```

cat_type | |
---|---|

0 | NaN |

1 | alpha |

2 | beta |

3 | NaN |

4 | gamma |

5 | NaN |

6 | beta |

Quite often, you’ll run into a situation where you want to add a category. You can do this with `.add_categories`

:

```
df["cat_type"] = df["cat_type"].cat.add_categories(["delta"])
df["cat_type"]
```

```
0 NaN
1 alpha
2 beta
3 NaN
4 gamma
5 NaN
6 beta
Name: cat_type, dtype: category
Categories (4, object): ['alpha', 'beta', 'gamma', 'delta']
```

Similarly, there is a `.remove_categories`

function and a `.remove_unused_categories()`

function. `.set_categories`

adds and removes categories in one fell swoop. One of the nice properties of set categories is that Remember that you need to do `df["columnname"].cat`

before calling any cat(egory) functions though.

## Operations on Categories#

As noted, ordered categories will already undergo some operations. But there are some that work on any set of categories. Perhaps the most useful is `value_counts`

```
df["cat_type"].value_counts()
```

```
beta 2
alpha 1
gamma 1
delta 0
Name: cat_type, dtype: int64
```

Note that even though ‘delta’ doesn’t appear at all, it gets a count (of zero). This tracking of missing values can be quite handy.

`mode`

is another one:

```
df["cat_type"].mode()
```

```
0 beta
Name: cat_type, dtype: category
Categories (4, object): ['alpha', 'beta', 'gamma', 'delta']
```

And if your categorical column happens to consist of *elements* that can undergo operations, those same operations will still work. For example,

```
time_df = pd.DataFrame(
pd.Series(pd.date_range("2015/05/01", periods=5, freq="M"), dtype="category"),
columns=["datetime"],
)
time_df
```

datetime | |
---|---|

0 | 2015-05-31 |

1 | 2015-06-30 |

2 | 2015-07-31 |

3 | 2015-08-31 |

4 | 2015-09-30 |

```
time_df["datetime"].dt.month
```

```
0 5
1 6
2 7
3 8
4 9
Name: datetime, dtype: int64
```

Finally, if you ever need to translate your actual data types in your categorical column into a code, you can use `.cat.codes`

to get unique codes for each value.

```
time_df["datetime"].cat.codes
```

```
0 0
1 1
2 2
3 3
4 4
dtype: int8
```