Categorical Data#
Introduction#
In this chapter, we’ll introduce how to work with categorical variables—that is, variables that have a fixed and known set of possible values. This chapter is enormously indebted to the pandas documentation.
Prerequisites#
This chapter will use the pandas data analysis package.
The Category Datatype#
Everything in Python has a type, even the data in pandas data frame columns. While you may be more familiar with numbers and even strings, there is also a special data type for categorical data called Categorical
. There are some benefits to using categorical variables (where appropriate):
they can keep track even when elements of the category isn’t present, which can sometimes be as interesting as when they are (imagine you find no-one from a particular school goes to university)
they can use vastly less of your computer’s memory than encoding the same information in other ways
they can be used efficiently with modelling packages, where they will be recognised as potential ‘dummy variables’, or with plotting packages, which will treat them as discrete values
you can order them (for example, “neutral”, “agree”, “strongly agree”)
All values of categorical data for a pandas column are either in the given categories or take the value np.nan
.
Creating Categorical Data#
Let’s create a categorical column of data:
import numpy as np
import pandas as pd
df = pd.DataFrame({"A": ["a", "b", "c", "a"]})
df["A"] = df["A"].astype("category")
df["A"]
0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'c']
Notice that we get some additional information at the bottom of the shown series: we get told that not only is this a categorical column type, but it has three values ‘a’, ‘b’, and ‘c’.
You can also use special functions, such as pd.cut()
, to groups data into discrete bins. Here’s an example where specify the labels for the categories directly:
df = pd.DataFrame({"value": np.random.randint(0, 100, 20)})
labels = [f"{i} - {i+9}" for i in range(0, 100, 10)]
df["group"] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df.head()
value | group | |
---|---|---|
0 | 41 | 40 - 49 |
1 | 7 | 0 - 9 |
2 | 47 | 40 - 49 |
3 | 64 | 60 - 69 |
4 | 70 | 70 - 79 |
In the example above, the group
column is of categorical type.
Another way to create a categorical variable is directly using the pd.Categorical()
function:
raw_cat = pd.Categorical(
["a", "b", "c", "a", "d", "a", "c"], categories=["b", "c", "d"]
)
raw_cat
[NaN, 'b', 'c', NaN, 'd', NaN, 'c']
Categories (3, object): ['b', 'c', 'd']
We can then enter this into a data frame:
df = pd.DataFrame(raw_cat, columns=["cat_type"])
df["cat_type"]
0 NaN
1 b
2 c
3 NaN
4 d
5 NaN
6 c
Name: cat_type, dtype: category
Categories (3, object): ['b', 'c', 'd']
Note that NaNs appear for any value that isn’t in the categories we specified—you can find more on this in Missing Values.
You can also create ordered categories:
ordered_cat = pd.Categorical(
["a", "b", "c", "a", "d", "a", "c"],
categories=["a", "b", "c", "d"],
ordered=True,
)
ordered_cat
['a', 'b', 'c', 'a', 'd', 'a', 'c']
Categories (4, object): ['a' < 'b' < 'c' < 'd']
Another useful function is qcut, which provides a categorical breakdown according to a given number of quantiles (eg 4 produces quartiles):
pd.qcut(range(1, 10), 4)
[(0.999, 3.0], (0.999, 3.0], (0.999, 3.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], (5.0, 7.0], (7.0, 9.0], (7.0, 9.0]]
Categories (4, interval[float64, right]): [(0.999, 3.0] < (3.0, 5.0] < (5.0, 7.0] < (7.0, 9.0]]
Exercise
Apply the 5-point Likert scale (Strongly disagree, Disagree, Neither Agree nor Disagree, Agree, Strongly Disagree) to the data generated by np.random.standard_normal(size=100)
using pd.qcut
with the keyword argument retbins=True
.
Working with Categories#
Categorical data has a categories
and a ordered
property; these list the possible values and whether the ordering matters or not respectively. These properties are exposed as .cat.categories
and .cat.ordered
. If you don’t manually specify categories and ordering, they are inferred from the passed arguments.
Let’s see some examples:
df["cat_type"].cat.categories
Index(['b', 'c', 'd'], dtype='object')
df["cat_type"].cat.ordered
False
If categorical data is ordered (ie .cat.ordered == True
), then the order of the categories has a meaning and certain operations are possible: you can sort values (with .sort_values
), and apply .min
and .max
.
Renaming Categories#
Renaming categories is done using the rename_categories()
method (which works with a list or a dictionary).
df["cat_type"] = df["cat_type"].cat.rename_categories(["alpha", "beta", "gamma"])
df
cat_type | |
---|---|
0 | NaN |
1 | alpha |
2 | beta |
3 | NaN |
4 | gamma |
5 | NaN |
6 | beta |
Quite often, you’ll run into a situation where you want to add a category. You can do this with .add_categories()
:
df["cat_type"] = df["cat_type"].cat.add_categories(["delta"])
df["cat_type"]
0 NaN
1 alpha
2 beta
3 NaN
4 gamma
5 NaN
6 beta
Name: cat_type, dtype: category
Categories (4, object): ['alpha', 'beta', 'gamma', 'delta']
Similarly, there is a .remove_categories()
function and a .remove_unused_categories()
function. .set_categories()
adds and removes categories in one fell swoop. One of the nice properties of set categories is that Remember that you need to do df["columnname"].cat
before calling any cat(egory) functions though.
Operations on Categories#
As noted, ordered categories will already undergo some operations. But there are some that work on any set of categories. Perhaps the most useful is value_counts()
.
df["cat_type"].value_counts()
cat_type
beta 2
alpha 1
gamma 1
delta 0
Name: count, dtype: int64
Note that even though ‘delta’ doesn’t appear at all, it gets a count (of zero). This tracking of missing values can be quite handy.
mode()
is another one:
df["cat_type"].mode()
0 beta
Name: cat_type, dtype: category
Categories (4, object): ['alpha', 'beta', 'gamma', 'delta']
And if your categorical column happens to consist of elements that can undergo operations, those same operations will still work. For example,
time_df = pd.DataFrame(
pd.Series(pd.date_range("2015/05/01", periods=5, freq="M"), dtype="category"),
columns=["datetime"],
)
time_df
datetime | |
---|---|
0 | 2015-05-31 |
1 | 2015-06-30 |
2 | 2015-07-31 |
3 | 2015-08-31 |
4 | 2015-09-30 |
time_df["datetime"].dt.month
0 5
1 6
2 7
3 8
4 9
Name: datetime, dtype: int32
Finally, if you ever need to translate your actual data types in your categorical column into a code, you can use .cat.codes
to get unique codes for each value.
time_df["datetime"].cat.codes
0 0
1 1
2 2
3 3
4 4
dtype: int8