Boolean Data#

Introduction#

In this chapter, we’ll introduce boolean data: data that can be True or False (which can also be encoded as 1s or 0s). We’ll first look at the fundamental Python boolean variables before seeing how true and false work in data frames. Some of this will be familiar from previous chapters but we’re going to dig a little deeper here.

Booleans#

Some of the most important operations you will perform are with True and False values, also known as boolean data types. These are fundamental Python variables, just as numbers such as 1 are.

Boolean Variables and Conditions#

To assign the value True or False to a variable is the same as with any other assignment:

bool_variable = True
bool_variable
True

There are two types of operation that are associated with booleans: boolean operations, in which existing booleans are combined, and condition operations, which create a boolean when executed.

Boolean operators that return booleans are as follows:

Operator

Description

x and y

are x and y both True?

x or y

is at least one of x and y True?

not x

is x False?

These behave as you’d expect: True and False evaluates to False, while True or False evaluates to True. There’s also the not keyword. For example

not True
False

as you would expect.

Conditions are expressions that evaluate as booleans. A simple example is 10 == 20. The == is an operator that compares the objects on either side and returns True if they have the same values–though be careful using it with different data types.

Here’s a table of conditions that return booleans:

Operator

Description

x == y

is x equal to y?

x != y

is x not equal to y?

x > y

is x greater than y?

x >= y

is x greater than or equal to y?

x < y

is x less than y?

x <= y

is x less than or equal to y?

x is y

is x the same object as y?

As you can see from the table, the opposite of == is !=, which you can read as ‘not equal to the value of’. Here’s an example of ==:

boolean_condition = 10 == 20
print(boolean_condition)
False

Exercise

What does not (not True) evaluate to?

The real power of conditions comes when we start to use them in more complex examples. Some of the keywords that evaluate conditions are if, else, and, or, in, not, and is. Here’s an example showing how some of these conditional keywords work:

name = "Ada"
score = 99

if name == "Ada" and score > 90:
    print("Ada, you achieved a high score.")

if name == "Smith" or score > 90:
    print("You could be called Smith or have a high score")

if name != "Smith" and score > 90:
    print("You are not called Smith and you have a high score")
Ada, you achieved a high score.
You could be called Smith or have a high score
You are not called Smith and you have a high score

All three of these conditions evaluate as True, and so all three messages get printed. Given that == and != test for equality and not equal, respectively, you may be wondering what the keywords is and not are for. Remember that everything in Python is an object, and that values can be assigned to objects. == and != compare values, while is and not compare objects. For example,

name_list = ["Ada", "Adam"]
name_list_two = ["Ada", "Adam"]

# Compare values
print(name_list == name_list_two)

# Compare objects
print(name_list is name_list_two)
True
False

Note that code with lots of branching if statements is not very helpful to you or to anyone else who reads your code. Some automatic code checkers will pick this up and tell you that your code is too complex. Almost all of the time, there’s a way to rewrite your code without lots of branching logic that will be better and clearer than having many nested if statements.

One of the most useful conditional keywords is in. This one must pop up ten times a day in most coders’ lives because it can pick out a variable or make sure something is where it’s supposed to be.

name_list = ["Lovelace", "Smith", "Hopper", "Babbage"]

print("Lovelace" in name_list)

print("Bob" in name_list)
True
False

Exercise

Check if “a” is in the string “Walloping weasels” using in. Is “a” in “Anodyne”?

The opposite is not in.

Finally, one conditional construct you’re bound to use at some point, is the ifelse structure:

score = 98

if score == 100:
    print("Top marks!")
elif score > 90 and score < 100:
    print("High score!")
elif score > 10 and score <= 90:
    pass
else:
    print("Better luck next time.")
High score!

Note that this does nothing if the score is between 11 and 90, and prints a message otherwise.

Exercise

Create a new ifelifelse statement that prints “well done” if a score is over 90, “good” if between 40 and 90, and “bad luck” otherwise.

One nice feature of Python is that you can make multiple boolean comparisons in a single line.

a, b = 3, 6

1 < a < b < 20
True

Conditions in list comprehensions#

List comprehensions are an incredibly useful pattern in Python. Here’s a simple one that produces a list of the first 12 numbers starting from 0:

[x for x in range(12)]
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

Booleans bring conditionality to the table. We’ll add an if statement followed by a condition that evaluates to either True or False depending on the value of x. So, for example, we can ask for only those numbers that are divisible by 2:

[x for x in range(12) if x % 2 == 0]
[0, 2, 4, 6, 8, 10]

This trick even works with an else clause (but note that we have moved both if and else before the for x in ... part)

[x if x % 2 == 0 else "Not divisible by 2" for x in range(12)]
[0,
 'Not divisible by 2',
 2,
 'Not divisible by 2',
 4,
 'Not divisible by 2',
 6,
 'Not divisible by 2',
 8,
 'Not divisible by 2',
 10,
 'Not divisible by 2']

Truthsy and Falsy Values#

Python objects can be used in expressions that will return a boolean value, such as when a list, listy, is used with if listy. Built-in Python objects that are empty are usually evaluated as False, and are said to be ‘Falsy’. In contrast, when these built-in objects are not empty, they evaluate as True and are said to be ‘truthy’. Let’s see some examples:

listy = []
other_listy = [1, 2, 3]

if not (listy):
    print("Falsy")
else:
    print("Truthy")
Falsy
if not (other_listy):
    print("Falsy")
else:
    print("Truthy")
Truthy

The method doesn’t just operate on lists; it’ll work for many various other truthy and falsy objects:

if not 0:
    print("Falsy")
else:
    print("Truthy")
Falsy
if not [0, 0, 0]:
    print("Falsy")
else:
    print("Truthy")
Truthy

Note that zero was falsy, its the nothing of a float, but a list of three zeros is not an empty list, so it evaluates as truthy.

if not None:
    print("Falsy")
else:
    print("Truthy")
Falsy

Knowing what is truthy or falsy is useful in practice; imagine you’d like to default to a specific behaviour if a list called list_vals doesn’t have any values in. You now know you can do it simply with if list_vals.

any() and all()#

Of course, there is a big wide world of booleans out there; they don’t always occur on their own. That’s why the operators any() and all() exist. These apply to iterables of booleans, like a list of booleans.

any() takes a list of booleans with at least one true value and returns true:

any([True, False, False])
True

all() takes a list of booleans and returns true only if all values are true:

all([True, True, True, True])
True

Both of these also work for 1s and 0s:

all([0, 0, 0, 1])
False

Booleans in pandas data frames#

Operations on booleans in data frames#

Quite often, you will run into a scenario where you’re working with data that have True or False values in a data frame. It is easy to create a column of booleans in a pandas data frame:

import pandas as pd

df = pd.DataFrame.from_dict(
    {
        "bool_col_1": [False] * 3 + [True, True],
        "bool_col_2": [True, False, True, False, True],
    }
)
df
bool_col_1 bool_col_2
0 False True
1 False False
2 False True
3 True False
4 True True

We can perform operations on these just like regular pandas data frame columns. These accept & (and), | (or), == (equal), and != (not equal) as operations:

df["bool_col_1"] | df["bool_col_2"]
0     True
1    False
2     True
3     True
4     True
dtype: bool

Quite often, it’s useful to have a count of the number of true values. If you take the sum of boolean columns in a pandas data frame, it will tot up the number of True values:

df.sum()
bool_col_1    2
bool_col_2    3
dtype: int64

And if you ever get data formatted as 1s and 0s rather than True and False, it’s easy to convert by changing the data type:

df = pd.DataFrame.from_dict({"bool_col": [0, 1, 0, 1, 1]})
df["bool_col"].astype(bool)
0    False
1     True
2    False
3     True
4     True
Name: bool_col, dtype: bool

Creating booleans from comparisons using columns#

It’s also possible to create boolean columns from numerical (or some other) columns. Let’s use the diamonds dataset to demonstrate this:

diamonds = pd.read_csv(
    "https://github.com/mwaskom/seaborn-data/raw/master/diamonds.csv"
)
diamonds.head()
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75

We’re going to create a new boolean variable for whenever the price is above 1000.

diamonds["expensive"] = diamonds["price"] > 1000
diamonds.sample(10)
carat cut color clarity depth table price x y z expensive
35182 0.31 Ideal G IF 62.0 54.0 891 4.36 4.38 2.71 False
29828 0.42 Very Good F SI2 61.5 59.0 710 4.81 4.85 2.97 False
5776 0.73 Ideal F VVS1 62.7 56.0 3900 5.76 5.72 3.60 True
17029 1.11 Ideal E SI1 61.1 57.0 6800 6.64 6.72 4.08 True
4121 0.93 Good D SI2 63.4 59.0 3540 6.15 6.18 3.91 True
28957 0.31 Premium I VS2 61.1 58.0 435 4.36 4.38 2.67 False
53778 0.72 Very Good G VS2 62.5 59.0 2728 5.69 5.71 3.56 True
1250 0.72 Ideal H VVS1 62.5 57.0 2946 5.70 5.73 3.57 True
34157 0.33 Premium G VS1 61.9 58.0 854 4.46 4.43 2.75 False
1270 1.00 Fair E SI2 65.8 58.0 2948 6.28 6.16 4.09 True

Of course, this could also have been achieved in a call to assign:

diamonds.assign(expensive=lambda x: x["price"] > 1000).head()
carat cut color clarity depth table price x y z expensive
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43 False
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 False
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31 False
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63 False
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75 False

Another use of booleans that is quite useful when it comes to data frames is the .isin function. For example, if you just want some True or False values for whether a set of columns is in a data frame:

diamonds.columns.isin(["x", "y", "z"])
array([False, False, False, False, False, False, False,  True,  True,
        True, False])

any() and all() in data frames#

A pandas column of booleans behaves a lot like a list of booleans, and we can apply the same logic to it via pandas built-in .any() and .all() methods. We expect some entries for "expensive" to be true, so any() should return true:

diamonds["expensive"].any()
True

Logical subsetting#

Although we’ve been effectively using this all along, it’s useful to make it explicit: booleans can be used to logically subset a dataframe. Let’s say we only want the bits of a dataframe where x is greater than y:

diamonds[diamonds["x"] > diamonds["y"]]
carat cut color clarity depth table price x y z expensive
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31 False
8 0.22 Fair E VS2 65.1 61.0 337 3.87 3.78 2.49 False
11 0.23 Ideal J VS1 62.8 56.0 340 3.93 3.90 2.46 False
12 0.22 Premium F SI1 60.4 61.0 342 3.88 3.84 2.33 False
14 0.20 Premium E SI2 60.2 62.0 345 3.79 3.75 2.27 False
... ... ... ... ... ... ... ... ... ... ... ...
53928 0.79 Premium E SI2 61.4 58.0 2756 6.03 5.96 3.68 True
53929 0.71 Ideal G VS1 61.4 56.0 2756 5.76 5.73 3.53 True
53930 0.71 Premium E SI1 60.5 55.0 2756 5.79 5.74 3.49 True
53931 0.71 Premium F SI1 59.8 62.0 2756 5.74 5.73 3.43 True
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74 True

23423 rows × 11 columns

The expression diamonds["x"] > diamonds["y"] creates a column of booleans that is used to filter to just the rows where the condition is true.