15. Boolean Data#
15.1. Introduction#
In this chapter, we’ll introduce boolean data: data that can be True
or False
(which can also be encoded as 1s or 0s). We’ll first look at the fundamental Python true and false boolean variables before seeing how true and false work in data frames.
15.2. Booleans#
Some of the most important operations you will perform are with True
and False
values, also known as boolean data types. These are fundamental Python variables, just as numbers such as 1
are.
15.2.1. Boolean Variables and Conditions#
To assign the value True
or False
to a variable is the same as with any other assignment:
bool_variable = True
bool_variable
True
There are two types of operation that are associated with booleans: boolean operations, in which existing booleans are combined, and condition operations, which create a boolean when executed.
Boolean operators that return booleans are as follows:
Operator |
Description |
---|---|
|
are |
|
is at least one of |
|
is |
These behave as you’d expect: True and False
evaluates to False
, while True or False
evaluates to True
. There’s also the not
keyword. For example
not True
False
as you would expect.
Conditions are expressions that evaluate as booleans. A simple example is 10 == 20
. The ==
is an operator that compares the objects on either side and returns True
if they have the same values–though be careful using it with different data types.
Here’s a table of conditions that return booleans:
Operator |
Description |
---|---|
|
is |
|
is |
|
is |
|
is |
|
is |
|
is |
|
is |
As you can see from the table, the opposite of ==
is !=
, which you can read as ‘not equal to the value of’. Here’s an example of ==
:
boolean_condition = 10 == 20
print(boolean_condition)
False
Exercise
What does not (not True)
evaluate to?
The real power of conditions comes when we start to use them in more complex examples. Some of the keywords that evaluate conditions are if
, else
, and
, or
, in
, not
, and is
. Here’s an example showing how some of these conditional keywords work:
name = "Ada"
score = 99
if name == "Ada" and score > 90:
print("Ada, you achieved a high score.")
if name == "Smith" or score > 90:
print("You could be called Smith or have a high score")
if name != "Smith" and score > 90:
print("You are not called Smith and you have a high score")
Ada, you achieved a high score.
You could be called Smith or have a high score
You are not called Smith and you have a high score
All three of these conditions evaluate as True, and so all three messages get printed. Given that ==
and !=
test for equality and not equal, respectively, you may be wondering what the keywords is
and not
are for. Remember that everything in Python is an object, and that values can be assigned to objects. ==
and !=
compare values, while is
and not
compare objects. For example,
name_list = ["Ada", "Adam"]
name_list_two = ["Ada", "Adam"]
# Compare values
print(name_list == name_list_two)
# Compare objects
print(name_list is name_list_two)
True
False
Note that code with lots of branching if statements is not very helpful to you or to anyone else who reads your code. Some automatic code checkers will pick this up and tell you that your code is too complex. Almost all of the time, there’s a way to rewrite your code without lots of branching logic that will be better and clearer than having many nested if
statements.
One of the most useful conditional keywords is in
. This one must pop up ten times a day in most coders’ lives because it can pick out a variable or make sure something is where it’s supposed to be.
name_list = ["Lovelace", "Smith", "Hopper", "Babbage"]
print("Lovelace" in name_list)
print("Bob" in name_list)
True
False
Exercise
Check if “a” is in the string “Walloping weasels” using in
. Is “a” in
“Anodyne”?
The opposite is not in
.
Finally, one conditional construct you’re bound to use at some point, is the if
…else
structure:
score = 98
if score == 100:
print("Top marks!")
elif score > 90 and score < 100:
print("High score!")
elif score > 10 and score <= 90:
pass
else:
print("Better luck next time.")
High score!
Note that this does nothing if the score is between 11 and 90, and prints a message otherwise.
Exercise
Create a new if
… elif
… else
statement that prints “well done” if a score is over 90, “good” if between 40 and 90, and “bad luck” otherwise.
One nice feature of Python is that you can make multiple boolean comparisons in a single line.
a, b = 3, 6
1 < a < b < 20
True
15.2.2. Conditions in list comprehensions#
List comprehensions are an incredibly useful pattern in Python. Here’s a simple one that produces a list of the first 12 numbers starting from 0:
[x for x in range(12)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
Booleans bring conditionality to the table. We’ll add an if
statement followed by a condition that evaluates to either True or False depending on the value of x
. So, for example, we can ask for only those numbers that are divisible by 2:
[x for x in range(12) if x % 2 == 0]
[0, 2, 4, 6, 8, 10]
This trick even works with an else
clause (but note that we have moved both if
and else
before the for x in ...
part)
[x if x % 2 == 0 else "Not divisible by 2" for x in range(12)]
[0,
'Not divisible by 2',
2,
'Not divisible by 2',
4,
'Not divisible by 2',
6,
'Not divisible by 2',
8,
'Not divisible by 2',
10,
'Not divisible by 2']
15.2.3. Truthsy and Falsy Values#
Python objects can be used in expressions that will return a boolean value, such as when a list, listy
, is used with if listy
. Built-in Python objects that are empty are usually evaluated as False
, and are said to be ‘Falsy’. In contrast, when these built-in objects are not empty, they evaluate as True
and are said to be ‘truthy’.
Let’s see some examples:
listy = []
other_listy = [1, 2, 3]
if not (listy):
print("Falsy")
else:
print("Truthy")
Falsy
if not (other_listy):
print("Falsy")
else:
print("Truthy")
Truthy
The method doesn’t just operate on lists; it’ll work for many various other truthy and falsy objects:
if not 0:
print("Falsy")
else:
print("Truthy")
Falsy
if not [0, 0, 0]:
print("Falsy")
else:
print("Truthy")
Truthy
Note that zero was falsy, its the nothing of a float, but a list of three zeros is not an empty list, so it evaluates as truthy.
if not None:
print("Falsy")
else:
print("Truthy")
Falsy
Knowing what is truthy or falsy is useful in practice; imagine you’d like to default to a specific behaviour if a list called list_vals
doesn’t have any values in. You now know you can do it simply with if list_vals
.
15.2.4. any() and all()#
Of course, there is a big wide world of booleans out there; they don’t always occur on their own. That’s why the operators any()
and all()
exist. These apply to iterables of booleans, like a list of booleans.
any()
takes a list of booleans with at least one true value and returns true:
any([True, False, False])
True
all()
takes a list of booleans and returns true only if all values are true:
all([True, True, True, True])
True
Both of these also work for 1s and 0s:
all([0, 0, 0, 1])
False
15.3. Booleans in pandas data frames#
15.3.1. Operations on booleans in data frames#
Quite often, you will run into a scenario where you’re working with data that have True or False values in a data frame. It is easy to create a column of booleans in a pandas data frame:
import pandas as pd
df = pd.DataFrame.from_dict(
{
"bool_col_1": [False] * 3 + [True, True],
"bool_col_2": [True, False, True, False, True],
}
)
df
bool_col_1 | bool_col_2 | |
---|---|---|
0 | False | True |
1 | False | False |
2 | False | True |
3 | True | False |
4 | True | True |
We can perform operations on these just like regular pandas data frame columns. These accept &
(and), |
(or), ==
(equal), and !=
(not equal) as operations:
df["bool_col_1"] | df["bool_col_2"]
0 True
1 False
2 True
3 True
4 True
dtype: bool
Quite often, it’s useful to have a count of the number of true values. If you take the sum of boolean columns in a pandas data frame, it will tot up the number of True
values:
df.sum()
bool_col_1 2
bool_col_2 3
dtype: int64
And if you ever get data formatted as 1s and 0s rather than True and False, it’s easy to convert by changing the data type:
df = pd.DataFrame.from_dict({"bool_col": [0, 1, 0, 1, 1]})
df["bool_col"].astype(bool)
0 False
1 True
2 False
3 True
4 True
Name: bool_col, dtype: bool
15.3.2. Creating booleans from comparisons using columns#
It’s also possible to create boolean columns from numerical (or some other) columns. Let’s use the diamonds dataset to demonstrate this:
diamonds = pd.read_csv(
"https://github.com/mwaskom/seaborn-data/raw/master/diamonds.csv"
)
diamonds.head()
carat | cut | color | clarity | depth | table | price | x | y | z | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 |
We’re going to create a new boolean variable for whenever the price is above 1000.
diamonds["expensive"] = diamonds["price"] > 1000
diamonds.sample(10)
carat | cut | color | clarity | depth | table | price | x | y | z | expensive | |
---|---|---|---|---|---|---|---|---|---|---|---|
9321 | 0.90 | Very Good | D | SI1 | 62.5 | 59.0 | 4579 | 6.11 | 6.15 | 3.83 | True |
20984 | 1.62 | Premium | D | SI2 | 61.2 | 61.0 | 9199 | 7.60 | 7.53 | 4.63 | True |
33146 | 0.31 | Ideal | G | VVS1 | 61.6 | 56.0 | 816 | 4.35 | 4.39 | 2.69 | False |
29834 | 0.38 | Ideal | F | SI1 | 61.3 | 55.0 | 710 | 4.63 | 4.76 | 2.88 | False |
30859 | 0.33 | Ideal | G | VS2 | 62.2 | 56.0 | 743 | 4.47 | 4.44 | 2.77 | False |
24020 | 1.40 | Very Good | E | VS1 | 59.1 | 57.0 | 12196 | 7.28 | 7.38 | 4.33 | True |
4840 | 1.01 | Very Good | F | SI2 | 59.5 | 57.0 | 3709 | 6.61 | 6.66 | 3.95 | True |
26709 | 0.32 | Ideal | H | VVS2 | 61.8 | 55.0 | 645 | 4.38 | 4.42 | 2.72 | False |
35051 | 0.42 | Ideal | I | VVS1 | 62.0 | 55.0 | 884 | 4.80 | 4.88 | 3.00 | False |
46587 | 0.55 | Very Good | E | VS2 | 61.5 | 56.0 | 1786 | 5.25 | 5.28 | 3.24 | True |
Of course, this could also have been achieved in a call to assign:
diamonds.assign(expensive=lambda x: x["price"] > 1000).head()
carat | cut | color | clarity | depth | table | price | x | y | z | expensive | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.23 | Ideal | E | SI2 | 61.5 | 55.0 | 326 | 3.95 | 3.98 | 2.43 | False |
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | False |
2 | 0.23 | Good | E | VS1 | 56.9 | 65.0 | 327 | 4.05 | 4.07 | 2.31 | False |
3 | 0.29 | Premium | I | VS2 | 62.4 | 58.0 | 334 | 4.20 | 4.23 | 2.63 | False |
4 | 0.31 | Good | J | SI2 | 63.3 | 58.0 | 335 | 4.34 | 4.35 | 2.75 | False |
Another use of booleans that is quite useful when it comes to data frames is the .isin()
function. For example, if you just want some True or False values for whether a set of columns is in a data frame:
diamonds.columns.isin(["x", "y", "z"])
array([False, False, False, False, False, False, False, True, True,
True, False])
15.3.3. any() and all() in data frames#
A pandas column of booleans behaves a lot like a list of booleans, and we can apply the same logic to it via pandas built-in .any()
and .all()
methods. We expect some entries for "expensive"
to be true, so any()
should return true:
diamonds["expensive"].any()
True
15.3.4. Logical subsetting#
Although we’ve been effectively using this all along, it’s useful to make it explicit: booleans can be used to logically subset a data frame. Let’s say we only want the bits of a data frame where x
is greater than y
:
diamonds[diamonds["x"] > diamonds["y"]]
carat | cut | color | clarity | depth | table | price | x | y | z | expensive | |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 0.21 | Premium | E | SI1 | 59.8 | 61.0 | 326 | 3.89 | 3.84 | 2.31 | False |
8 | 0.22 | Fair | E | VS2 | 65.1 | 61.0 | 337 | 3.87 | 3.78 | 2.49 | False |
11 | 0.23 | Ideal | J | VS1 | 62.8 | 56.0 | 340 | 3.93 | 3.90 | 2.46 | False |
12 | 0.22 | Premium | F | SI1 | 60.4 | 61.0 | 342 | 3.88 | 3.84 | 2.33 | False |
14 | 0.20 | Premium | E | SI2 | 60.2 | 62.0 | 345 | 3.79 | 3.75 | 2.27 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
53928 | 0.79 | Premium | E | SI2 | 61.4 | 58.0 | 2756 | 6.03 | 5.96 | 3.68 | True |
53929 | 0.71 | Ideal | G | VS1 | 61.4 | 56.0 | 2756 | 5.76 | 5.73 | 3.53 | True |
53930 | 0.71 | Premium | E | SI1 | 60.5 | 55.0 | 2756 | 5.79 | 5.74 | 3.49 | True |
53931 | 0.71 | Premium | F | SI1 | 59.8 | 62.0 | 2756 | 5.74 | 5.73 | 3.43 | True |
53938 | 0.86 | Premium | H | SI2 | 61.0 | 58.0 | 2757 | 6.15 | 6.12 | 3.74 | True |
23423 rows × 11 columns
The expression diamonds["x"] > diamonds["y"]
creates a column of booleans that is used to filter to just the rows where the condition is true.