28. Iteration#

28.1. Introduction#

In Functions, we talked about how important it is to reduce duplication in your code by creating functions instead of copying-and-pasting. Reducing code duplication has three main benefits:

  1. It’s easier to see the intent of your code, because your eyes are drawn to what’s different, not what stays the same.

  2. It’s easier to respond to changes in requirements. As your needs change, you only need to make changes in one place, rather than remembering to change every place that you copied-and-pasted the code.

  3. You’re likely to have fewer bugs because each line of code is used in more places.

One tool for reducing duplication is functions, which reduce duplication by identifying repeated patterns of code and extract them out into independent pieces that can be easily reused and updated. Another tool for reducing duplication is iteration, which helps you when you need to do the same thing to multiple inputs: repeating the same operation on different columns, or on different datasets.

In this chapter you’ll learn about iteration in three ways: explicit iteration, using for loops and while loops; iteration via comprehensions (eg list comprehensions); and iteration for pandas data frames.

28.1.1. Prerequisites#

This chapter will use the pandas data analysis package.

28.2. For Loops#

A loop is a way of executing a similar piece of code over and over in a similar way.

A for loop does something for the time that the condition is satisfied. For example,

name_list = ["Lovelace", "Smith", "Pigou", "Babbage"]

for name in name_list:
    print(name)
Lovelace
Smith
Pigou
Babbage

prints out a name until all names have been printed out.

Every for loop has three components:

  1. The output, here a print statement. But you can imagine a for loop that populates each entry of a data frame or list (but you should always create the full Python object first and populate it later rather than changing its size within the loop because the latter is slow).

  2. The sequence: for name in name_list:. This determines what to loop over: each run of the for loop will assign name to a different value from the iterable name_list. It doesn’t have to be a list, any iterable object will do. It’s useful to think of name above as a pronoun, like “it”.

  3. The body: print(name). This is the code that does the work. It’s run repeatedly, each time with a different value for name. The first iteration will effectively run print(name_list[0]), the second will run print(name_list[1]), and so on.

As long as your object is an iterable (ie you can iterate over it), then it can be used in this way in a for loop. The most common examples are lists and tuples, but you can also iterate over strings (in which case each character is selected in turn). One gotcha to be aware of is if you iterate over a string, say “hello”, instead of iterating over a list (or tuple) of strings, eg ["hello"]. In the latter case, you get:

for entry in ["hello"]:
    print(entry)
    print("---end entry---")
hello
---end entry---

While in the former you get something quite different and typically not all that useful:

for entry in "hello":
    print(entry)
    print("---end entry---")
h
---end entry---
e
---end entry---
l
---end entry---
l
---end entry---
o
---end entry---

Exercise

Write a for loop that prints out “Python for Data Science” so that each word is printed in a successive iteration.

A useful trick with for loops is the enumerate() keyword, which runs through an index that keeps track of the place of items in a list:

name_list = ["Lovelace", "Smith", "Hopper", "Babbage"]

for i, name in enumerate(name_list):
    print(f"The name in position {i} is {name}")
The name in position 0 is Lovelace
The name in position 1 is Smith
The name in position 2 is Hopper
The name in position 3 is Babbage

Remember, Python indexes from 0 so the first entry of i will be zero. But, if you’d like to index from a different number, you can:

for i, name in enumerate(name_list, start=1):
    print(f"The name in position {i} is {name}")
The name in position 1 is Lovelace
The name in position 2 is Smith
The name in position 3 is Hopper
The name in position 4 is Babbage

Another useful pattern when doing for loops with dictionaries is iteration over key, value pairs. We’ll get to learn more about dictionaries very shortly, but for now what’s important is that they map a key to a value, for example “apple” might map to “fruit”. Let’s take our example from earlier that mapped cities to temperatures. If we wanted to iterate over both keys and values, we can write a for loop like this:

cities_to_temps = {"Paris": 28, "London": 22, "Seville": 36, "Wellesley": 29}

for key, value in cities_to_temps.items():
    print(f"In {key}, the temperature is {value} degrees C today.")
In Paris, the temperature is 28 degrees C today.
In London, the temperature is 22 degrees C today.
In Seville, the temperature is 36 degrees C today.
In Wellesley, the temperature is 29 degrees C today.

Note that we added .items() to the end of the dictionary. And note that we didn’t have to call the key key, or the value value: these are set by their position. But part of best practice in writing code is that there should be no surprises, and writing key, value makes it really clear that you’re using values from a dictionary.

Exercise

Write a dictionary that maps four cities you know into their respective countries and print the results using the key, value iteration trick.

Another useful type of for loop is provided by the zip() function. You can think of the zip() function as being like a zipper, bringing elements from two different iterators together in turn. Here’s an example:

first_names = ["Ada", "Adam", "Grace", "Charles"]
last_names = ["Lovelace", "Smith", "Hopper", "Babbage"]

for forename, surname in zip(first_names, last_names):
    print(f"{forename} {surname}")
Ada Lovelace
Adam Smith
Grace Hopper
Charles Babbage

The zip function is super useful in practice.

Exercise

Zip together the first names from above with this jumbled list of surnames: ['Babbage', 'Hopper', 'Smith', 'Lovelace'].

(Hint: you have seen a trick to help re-arrange lists earlier on in the Chapter.)

28.3. List (and Other) Comprehensions#

There’s a second way to do loops in Python and, in most but not all cases, they run faster. More importantly, and this is the reason it’s good practice to use them where possible, they are very readable. They are called list comprehensions.

List comprehensions can combine what a for loop and (if needed) what a condition do in a single line of code. First, let’s look at a for loop that adds one to each value done as a list comprehension (NB: in practice, we would use super-fast numpy arrays for this kind of operation):

num_list = range(50, 60)
[1 + num for num in num_list]
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60]

The general pattern is a bit similar to with the for loop but there are some differences. There’s no colon, and no indenting. The syntax is “do something with x” then for x in iterable. Finally, the expression is wrapped in a [ and ] to make the output a list.

Note that lists are not the only wrapping you can provide to this kind of structure. A ( and ) to make it a generator (don’t worry about what this is for now), a { and } to make it a set (an object that only contains unique values), or it’s possible to create a dictionary from a comprehension too! List comprehensions are the most common, so if you only remember one kind, remember them.

Exercise

Create a list comprehension that multiplies numbers in the range from 1 to 10 by 5.

Did you get the range right?

Let’s now see how to include a condition within a list comprehension. Say we had a list of numbers and wanted to filter it according to whether the numbers divided by 3 or not using the modulo operator:

number_list = range(1, 40)
divide_list = [x for x in number_list if x % 3 == 0]
print(divide_list)
[3, 6, 9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39]

The syntax here is do something to x for x in something if x satisfies some condition.

Here’s another example that picks out only the names that include ‘Smith’ in them:

names_list = ["Joe Bloggs", "Adam Smith", "Sandra Noone", "leonara smith"]
smith_list = [x for x in names_list if "smith" in x.lower()]
print(smith_list)
['Adam Smith', 'leonara smith']

Note how we used ‘smith’ rather than ‘Smith’ and then used lower() to ensure we matched names regardless of the case they are written in.

We can even do a whole ifelse construct inside a list comprehension:

names_list = ["Joe Bloggs", "Adam Smith", "Sandra Noone", "leonara smith"]
smith_list = [x if "smith" in x.lower() else "Not Smith!" for x in names_list]
print(smith_list)
['Not Smith!', 'Adam Smith', 'Not Smith!', 'leonara smith']

Many of the constructs we’ve seen can be combined. For instance, there is no reason why we can’t have a nested or repeated list comprehension using zip(), and, perhaps more surprisingly, sometimes these are useful!

first_names = ["Ada", "Adam", "Grace", "Charles"]
last_names = ["Lovelace", "Smith", "Hopper", "Babbage"]
names_list = [x + " " + y for x, y in zip(first_names, last_names)]
print(names_list)
['Ada Lovelace', 'Adam Smith', 'Grace Hopper', 'Charles Babbage']

An even more extreme use of list comprehensions can deliver nested structures:

first_names = ["Ada", "Adam"]
last_names = ["Lovelace", "Smith"]
names_list = [[x + " " + y for x in first_names] for y in last_names]
print(names_list)
[['Ada Lovelace', 'Adam Lovelace'], ['Ada Smith', 'Adam Smith']]

This gives a nested structure that (in this case) iterates over first_names first, and then last_names. (Note that this object is a list of lists of strings!)

Let’s see a dictionary comprehension now. These look a bit similar to set comprehensions because they use { and } at either end but they are different because they come with a colon separating the keys from the values:

{key: value for key, value in zip(first_names, last_names)}
{'Ada': 'Lovelace', 'Adam': 'Smith'}

Exercise

Create a nested list comprehension that results in a list of lists of strings equal to [['a0', 'b0', 'c0'], ['a1', 'b1', 'c1'], ['a2', 'b2', 'c2']] (ie a combination of the first three integers and letters of the alphabet). You may find that you need to convert numbers to strings using str(x) to do this.

If you’d like to learn more about list comprehensions, check out these short video tutorials.

28.4. While Loops#

while loops continue to execute code until their conditional expression evaluates to False. (Of course, if it evaluates to True forever, your code will just continue to execute…)

n = 10
while n > 0:
    print(n)
    n -= 1

print("execution complete")
10
9
8
7
6
5
4
3
2
1
execution complete

NB: in case you’re wondering what -= does, it’s a compound assignment that sets the left-hand side equal to the left-hand side minus the right-hand side.

You can use the keyword break to break out of a while loop, for example if it’s reached a certain number of iterations without converging.

Exercise

Making use of import string and then string.ascii_lowercase to get the characters in the alphabet, write a while loop that iterates backwards through the alphabet (starting at “z”) before printing “execution complete”.

28.5. Iteration with pandas Data Frames#

For loops, while loops, and comprehensions all work on pandas data frames, but they are generally a bad way to get things done because they are slow and not memory efficient. To aid cases where iteration is needed, pandas has built-in methods for iteration depending on what you need to do.

These built-in methods for iteration have an overlap with what we’ve seen in Data Transformation but we’ll dig a little deeper into assign()/assignment operations, apply(), and eval() here.

28.5.1. Assignment Operations and assign#

An assignment is a statement that assigns the value on the right to the object on the left with an equals sign in the middle.

Let’s imagine we have a data frame like this:

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.normal(size=(6, 4)), columns=["a", "b", "c", "d"])
df
a b c d
0 -0.224610 0.502060 -1.135296 -0.453168
1 0.859331 1.144309 -1.585443 0.150175
2 -1.809199 -0.066746 -1.376740 2.182179
3 -0.235759 -1.298939 -1.731243 -1.698847
4 -1.201214 -0.062717 -0.216686 0.781866
5 -1.131781 1.482878 -0.446276 0.056130

pandas has many built-in functions that are already built to iterate over rows and columns; for example, to compute the median of rows or columns respectively:

df.median(axis="rows")  # can also use axis=1
a   -0.683770
b    0.219671
c   -1.256018
d    0.103153
dtype: float64
df.median(axis="columns")  # can also use axis=0
0   -0.338889
1    0.504753
2   -0.721743
3   -1.498893
4   -0.139702
5   -0.195073
dtype: float64

In these cases, and others using built-in functions, the iteration is hidden. What if we want to do something that isn’t a built in and also isn’t an aggregation though? Let’s take the example of adding five to every entry. We could do it by explicitly iterating row by row, then repeat that for each column, ie

# Do not do this!


def add_five_slow(df):
    for i in range(len(df)):
        for j in range(len(df.columns)):
            df.iloc[i, j] = df.iloc[i, j] + 5


%timeit add_five_slow(df)
1.76 ms ± 5.81 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

But to do this, every individual cell must be accessed and operated on—so it is very slow, taking milliseconds. pandas has far faster ways of performing the same operation. For simple operations on data frames with consistent type, you can simply add five to the whole data frame:

%timeit df + 5
42.2 μs ± 124 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

This took tens of microseconds, much faster.

This also works on a per column basis, so you can do df["a"] = df["a"] + 5 and so on.

These operations have equivalents using the assign() operator, which allows for method chaining; stringing multiple operations together. The assign() operator version of df["new_a"] = df["a"] + 5 would be

df = df.assign(new_a=lambda x: x["a"] + 5)

28.5.2. Apply#

What happens if you have a more complicated function you want to iterate over? This is where pandasapply() comes in, and can be used with assignment. apply() can also be used across rows or columns. Like assign(), it can be combined with a lambda function and used with either the whole data frame or just a column (in which case no need to specify axis=).

df.apply(lambda x: x["a"] - x["new_a"].mean() * x["c"] / x["b"], axis=1)
0   -3.362471
1   -2.269930
2   -3.689900
3   -4.567631
4   -4.846017
5   -3.070733
dtype: float64

Note that this is just an example: you could still do this entire operation without using apply! But you will sometimes find yourself with cases where you do need to use it.

Apply also works with functions, including user-defined functions:

def complicated_function(x):
    return x - x.mean()


df = df.apply(complicated_function, axis=1)
df
a b c d new_a
0 -0.917485 -0.190815 -1.828172 -1.146043 4.082515
1 -0.426210 -0.141231 -2.870984 -1.135365 4.573790
2 -2.233258 -0.490805 -1.800799 1.758120 2.766742
3 -0.195650 -1.258830 -1.691134 -1.658737 4.804350
4 -1.821221 -0.682724 -0.836692 0.161859 3.178779
5 -1.897615 0.717044 -1.212110 -0.709704 3.102385

28.5.3. Eval(uate)#

eval() evaluates a string describing operations on DataFrame columns to create new columns. It operates on columns only, not rows or elements. Here’s an example:

df["ratio"] = df.eval("a / new_a")
df
a b c d new_a ratio
0 -0.917485 -0.190815 -1.828172 -1.146043 4.082515 -0.224735
1 -0.426210 -0.141231 -2.870984 -1.135365 4.573790 -0.093185
2 -2.233258 -0.490805 -1.800799 1.758120 2.766742 -0.807180
3 -0.195650 -1.258830 -1.691134 -1.658737 4.804350 -0.040723
4 -1.821221 -0.682724 -0.836692 0.161859 3.178779 -0.572931
5 -1.897615 0.717044 -1.212110 -0.709704 3.102385 -0.611663

Evaluate can also be used to create new boolean columns using, for example, a string "a > 0.5" in the above example.