Missing Values#
Introduction#
In this chapter, we’ll look at the tools and tricks for dealing with missing values. We’ll start by discussing some general tools for working with missing values recorded as NA
s. We’ll then explore the idea of implicitly missing values, values are that are simply absent from your data, and show some tools you can use to make them explicit.
We’ll finish off with a related discussion of empty groups, caused by categories that don’t appear in the data.
Prerequisites#
This chapter will use the pandas data analysis package.
Explicit Missing Values#
To begin, let’s explore a few handy tools for creating or eliminating missing explicit values, i.e. cells where you see an NA
or nan
.
Types of Missing Values#
One thing to note about missing values in pandas is that they are not all created alike!
For example, real numbers in pandas (such as the float64
dtype) uses a ‘nan’ (aka, Not a Number):
import pandas as pd
import numpy as np
df = pd.DataFrame([5, 27.3, np.nan, -16], columns=["numbers"])
df
numbers | |
---|---|
0 | 5.0 |
1 | 27.3 |
2 | NaN |
3 | -16.0 |
But this isn’t the only way! We can also use Python’s built-in None
value (which, here, gets converted to a NaN because the valid values are all floating point numbers) and pandas’ pd.NA
:
numbers = pd.DataFrame([pd.NA, 27.3, np.nan, -16, None], columns=["numbers"])
numbers
numbers | |
---|---|
0 | <NA> |
1 | 27.3 |
2 | NaN |
3 | -16 |
4 | None |
However, with the object data type (the default for strings), the types can co-exist:
fruits = pd.DataFrame(
["orange", np.nan, "apple", None, "banana", pd.NA], columns=["fruit"]
)
fruits
fruit | |
---|---|
0 | orange |
1 | NaN |
2 | apple |
3 | None |
4 | banana |
5 | <NA> |
Both of these types of missing value can be found using pandas .isna()
function. This returns a new column of boolean values that are True
if the value is any kind of missing value.
fruits.isna()
fruit | |
---|---|
0 | False |
1 | True |
2 | False |
3 | True |
4 | False |
5 | True |
As a convenience, there is also a notna()
function.
Dealing with Explicit Missing Values#
There are various options for dealing with missing values. The fillna()
function achieves this. Let’s take a look at it with some test data:
nan_df = pd.DataFrame(
[
[np.nan, 2, None, 0],
[3, 4, np.nan, 1],
[5, np.nan, np.nan, pd.NA],
[np.nan, 3, np.nan, 4],
],
columns=list("ABCD"),
)
nan_df
A | B | C | D | |
---|---|---|---|---|
0 | NaN | 2.0 | NaN | 0 |
1 | 3.0 | 4.0 | NaN | 1 |
2 | 5.0 | NaN | NaN | <NA> |
3 | NaN | 3.0 | NaN | 4 |
First, we can just fill any missing values with a single fixed value:
nan_df.fillna(0)
A | B | C | D | |
---|---|---|---|---|
0 | 0.0 | 2.0 | 0.0 | 0 |
1 | 3.0 | 4.0 | 0.0 | 1 |
2 | 5.0 | 0.0 | 0.0 | 0 |
3 | 0.0 | 3.0 | 0.0 | 4 |
This can be done on a by-column basis; here we replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
nan_df.fillna(value={"A": 0, "B": 1, "C": 2, "D": 3})
A | B | C | D | |
---|---|---|---|---|
0 | 0.0 | 2.0 | 2.0 | 0 |
1 | 3.0 | 4.0 | 2.0 | 1 |
2 | 5.0 | 1.0 | 2.0 | 3 |
3 | 0.0 | 3.0 | 2.0 | 4 |
We can also propagate non-null values forward or backward (relative to the index)
nan_df.fillna(method="ffill")
/var/folders/x6/ffnr59f116l96_y0q0bjfz7c0000gn/T/ipykernel_24148/1353804149.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
nan_df.fillna(method="ffill")
A | B | C | D | |
---|---|---|---|---|
0 | NaN | 2.0 | NaN | 0 |
1 | 3.0 | 4.0 | NaN | 1 |
2 | 5.0 | 4.0 | NaN | 1 |
3 | 5.0 | 3.0 | NaN | 4 |
nan_df.fillna(method="bfill")
/var/folders/x6/ffnr59f116l96_y0q0bjfz7c0000gn/T/ipykernel_24148/2505504399.py:1: FutureWarning: DataFrame.fillna with 'method' is deprecated and will raise in a future version. Use obj.ffill() or obj.bfill() instead.
nan_df.fillna(method="bfill")
A | B | C | D | |
---|---|---|---|---|
0 | 3.0 | 2.0 | NaN | 0 |
1 | 3.0 | 4.0 | NaN | 1 |
2 | 5.0 | 3.0 | NaN | 4 |
3 | NaN | 3.0 | NaN | 4 |
The forward fill and backward fill options are particularly useful for time series—but be careful using them if you’re doing a forecasting exercise!
Another feature of all of these functions is that you can limit the number of NaNs that get replaced using the limit=
keyword argument.
nan_df.fillna(value={"A": 0, "B": 1, "C": 2, "D": 3}, limit=1)
A | B | C | D | |
---|---|---|---|---|
0 | 0.0 | 2.0 | 2.0 | 0 |
1 | 3.0 | 4.0 | NaN | 1 |
2 | 5.0 | 1.0 | NaN | 3 |
3 | NaN | 3.0 | NaN | 4 |
Of course, another option might be just to filter out the missing values altogether. There are a couple of ways to do this depending if you want to remove entire rows (axis=0
) or columns (axis=1
)—but in this case, as there is at least one NaN in each column though, there will be no data left!
nan_df["A"].dropna(axis=0) # on a single column
1 3.0
2 5.0
Name: A, dtype: float64
nan_df.dropna(axis=1)
0 |
---|
1 |
2 |
3 |
dropna()
takes some keyword arguments too; for example how="all"
only drops a column or row if all of the values are NA.
nan_df.dropna(how="all")
A | B | C | D | |
---|---|---|---|---|
0 | NaN | 2.0 | NaN | 0 |
1 | 3.0 | 4.0 | NaN | 1 |
2 | 5.0 | NaN | NaN | <NA> |
3 | NaN | 3.0 | NaN | 4 |
There is a thresh
keyword (for threshold)—this allows you to keep only rows or columns containing at most a certain number of missing observations.
Another way to filter out nans is to use the same filtering methods you would use normally, via boolean columns, in combination with the .notna()
function. In the below example, we see all columns for the rows for which A is not NA.
nan_df[nan_df["A"].notna()]
A | B | C | D | |
---|---|---|---|---|
1 | 3.0 | 4.0 | NaN | 1 |
2 | 5.0 | NaN | NaN | <NA> |
Adding NA values#
Sometimes you’ll hit the opposite problem where some concrete value actually represents a missing value. This typically arises in data generated by older software that doesn’t have a proper way to represent missing values, so it must instead use some special value like 99 or -999.
If possible, handle this when reading in the data, for example, by using the na_values=
keyword argument when calling pd.read_csv()
. If you discover the problem later, or your data source doesn’t provide a way to handle it on reading the file, you can use a range of options to replace the given data:
stata_df = pd.DataFrame([[3, 4, 5], [-7, 4, -99], [-99, 6, 5]], columns=list("ABC"))
stata_df
A | B | C | |
---|---|---|---|
0 | 3 | 4 | 5 |
1 | -7 | 4 | -99 |
2 | -99 | 6 | 5 |
The easiest option is probably .replace()
:
stata_df.replace({-99: np.nan})
A | B | C | |
---|---|---|---|
0 | 3.0 | 4 | 5.0 |
1 | -7.0 | 4 | NaN |
2 | NaN | 6 | 5.0 |
Because .replace()
accepts a dictionary, it’s possible to replace several values at once:
stata_df.replace({-99: np.nan, -7: np.nan})
A | B | C | |
---|---|---|---|
0 | 3.0 | 4 | 5.0 |
1 | NaN | 4 | NaN |
2 | NaN | 6 | 5.0 |
Note that this applies to every column in the dataframe. To apply it to just one, just select that specific column.
Implicit Missing Values#
So far we’ve talked about missing values that are explicitly missing, i.e. you can see an NA
or similar in your data.
But missing values can also be implicitly missing, if an entire row of data is simply absent from the data.
Let’s illustrate the difference with a simple data set that records the price of some stock each quarter:
stocks = pd.DataFrame(
{
"year": [2020, 2020, 2020, 2020, 2021, 2021, 2021],
"qtr": [1, 2, 3, 4, 2, 3, 4],
"price": [1.88, 0.59, 0.35, np.nan, 0.92, 0.17, 2.66],
}
)
stocks
year | qtr | price | |
---|---|---|---|
0 | 2020 | 1 | 1.88 |
1 | 2020 | 2 | 0.59 |
2 | 2020 | 3 | 0.35 |
3 | 2020 | 4 | NaN |
4 | 2021 | 2 | 0.92 |
5 | 2021 | 3 | 0.17 |
6 | 2021 | 4 | 2.66 |
This dataset has two missing observations:
The
price
in the fourth quarter of 2020 is explicitly missing, because its value isNA
.The
price
for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.
One way to think about the difference is with this Zen-like koan:
An explicit missing value is the presence of an absence.
An implicit missing value is the absence of a presence.
Sometimes you want to make implicit missings explicit in order to have something physical to work with. In other cases, explicit missings are forced upon you by the structure of the data and you want to get rid of them. The following sections discuss some tools for moving between implicit and explicit missingness.
Pivoting#
You’ve already seen one tool that can make implicit missings explicit and vice versa: pivoting. Making data wider can make implicit missing values explicit because every combination of the rows and new columns must have some value For example, if we pivot stocks
to put quarter
in the columns (and make year
the index), both missing values become explicit:
stocks.pivot(columns="qtr", values="price", index="year")
qtr | 1 | 2 | 3 | 4 |
---|---|---|---|---|
year | ||||
2020 | 1.88 | 0.59 | 0.35 | NaN |
2021 | NaN | 0.92 | 0.17 | 2.66 |
By default, making data longer preserves explicit missing values.
Missing Values in Categorical Variables#
A final type of missingness is the empty group, a group that doesn’t contain any observations, which can arise when working with categorical data.
For example, imagine we have a dataset that contains some health information about people:
health = pd.DataFrame(
{
"name": ["Ikaia", "Oletta", "Leriah", "Dashay", "Tresaun"],
"smoker": ["no", "no", "previously", "no", "yes"],
"age": [34, 88, 75, 47, 56],
}
)
health["smoker"] = health["smoker"].astype("category")
Now we drop the last row of data:
health_cut = health.iloc[:-1, :]
health_cut
name | smoker | age | |
---|---|---|---|
0 | Ikaia | no | 34 |
1 | Oletta | no | 88 |
2 | Leriah | previously | 75 |
3 | Dashay | no | 47 |
The value ‘yes’ for smoker now doesn’t (seem to) appear anywhere in our dataframe. But if we run value_counts()
to get a count of the number of each type of category, you’ll see that the data frame ‘remembers’ that there’s a ‘yes’ category that isn’t currently present:
health_cut["smoker"].value_counts()
smoker
no 3
previously 1
yes 0
Name: count, dtype: int64
You’ll see the same thing happen with a groupby()
operation:
health_cut.groupby("smoker")["age"].mean()
/var/folders/x6/ffnr59f116l96_y0q0bjfz7c0000gn/T/ipykernel_24148/3998383890.py:1: FutureWarning: The default of observed=False is deprecated and will be changed to True in a future version of pandas. Pass observed=False to retain current behavior or observed=True to adopt the future default and silence this warning.
health_cut.groupby("smoker")["age"].mean()
smoker
no 56.333333
previously 75.000000
yes NaN
Name: age, dtype: float64
You can see here that, because we took the mean of a number that doesn’t exist, we got a NaN in place of a real value for the yes row (but there is a ‘yes’ row).