27. Functions#

27.1. Introduction#

One of the best ways to improve your reach as a data scientist is to write functions. Functions allow you to automate common tasks in a more powerful and general way than copy-and-pasting. Writing a function has three big advantages over using copy-and-paste:

  1. You can give a function an evocative name that makes your code easier to understand.

  2. As requirements change, you only need to update code in one place, instead of many.

  3. You eliminate the chance of making incidental mistakes when you copy and paste (i.e. updating a variable name in one place, but not in another).

Writing good functions is a lifetime journey. Even after using Python for many years you can still learn new techniques and better ways of approaching old problems. The goal of this chapter is not to teach you every esoteric detail of functions but to get you started with some pragmatic advice that you can apply immediately.

As well as practical advice for writing functions, this chapter also gives you some suggestions for how to style your code. Good code style is like correct punctuation. Youcanmanagewithoutit, but it sure makes things easier to read! As with styles of punctuation, there are many possible variations. Here we present the style we use in our code, but the most important thing is to be consistent.

27.1.1. Prerequisites#

You will need the pandas and numpy packages for this chapter.

27.2. Functions#

A function has inputs, it performs its function, and it returns any outputs. Functions begin with a def keyword for ‘define a function’. It then has a name, followed by brackets, (), which may contain function arguments and function keyword arguments. This is followed by a colon. The body of the function is then indented relative to the left-most text. Function arguments are defined in brackets following the name, with different inputs separated by commas. Any outputs are given with the return keyword, again with different variables separated by commas.

Let’s see a very simple example of a function with a single argument (or arg):

def welcome_message(name):
    return f"Hello {name}, and welcome!"


# Without indentation, this code is not part of function
name = "Ada"
output_string = welcome_message(name)
print(output_string)
Hello Ada, and welcome!

One powerful feature of functions is that we can define defaults for the input arguments. These are called keyword arguments (or kwargs). Let’s see that in action by defining a default value for name, along with multiple outputs–a hello message and a score.

def score_message(score, name="student"):
    """This is a doc-string, a string describing a function.
    Args:
        score (float): Raw score
        name (str): Name of student
    Returns:
        str: A hello message.
        float: A normalised score.
    """
    norm_score = (score - 50) / 10
    return f"Hello {name}", norm_score


# Without indentation, this code is not part of function
name = "Ada"
score = 98
# No name entered
print(score_message(score))
# Name entered
print(score_message(score, name=name))
('Hello student', 4.8)
('Hello Ada', 4.8)

Arguments and keyword arguments

arguments are the variables that functions always need, so a and b in def add(a, b): return a + b. The function won’t work without them! Function arguments are sometimes referred to as args.

Keyword arguments are the variables that are optional for functions, so c in def add(a, b, c=5): return a + b - c. If you do not provide a value for c when calling the function, it will automatically revert to c=5. Keyword arguments are sometimes referred to as kwargs.

Exercise

What is the return type of a function with multiple return values separated by commas following the return statement?

In that last example, you’ll notice that we added some text to the function. This is a doc-string, or documentation string. It’s there to help users (and, most likely, future you) to understand what the function does. Let’s see how this works in action by calling help() on the score_message function:

help(score_message)
Help on function score_message in module __main__:

score_message(score, name='student')
    This is a doc-string, a string describing a function.
    Args:
        score (float): Raw score
        name (str): Name of student
    Returns:
        str: A hello message.
        float: A normalised score.

Exercise

Write a function that returns a high five unicode character if the input is equal to “coding for economists” and a sad face, “:-/” otherwise.

Add a second argument that takes a default argument of an empty string but, if used, is added (concatenated) to the return message. Use it to create the return output, “:-/ here is my message.”

Write a doc-string for your function and call help on it.

To learn more about args and kwargs, check out these short video tutorials.

27.3. When should you write a function?#

You should consider writing a function whenever you’ve copied and pasted a block of code more than twice (i.e. you now have three copies of the same code). For example, take a look at this code. What does it do?

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.normal(size=(10, 4)), columns=["a", "b", "c", "d"])

df["a"] = (df["a"] - df["a"].min()) / (df["a"].max() - df["a"].min())
df["b"] = (df["b"] - df["b"].min()) / (df["b"].max() - df["a"].min())
df["c"] = (df["c"] - df["c"].min()) / (df["c"].max() - df["c"].min())
df["d"] = (df["d"] - df["d"].min()) / (df["d"].max() - df["d"].min())

You might be able to puzzle out that this rescales each column to have a range from 0 to 1. But did you spot the mistake? There was an error when copying-and-pasting the code for df["b"]: someone forgot to change an a to a b. Extracting repeated code out into a function is a good idea because it prevents you from making this type of mistake.

To write a function you need to first analyse the code. How many inputs does it have?

df["a"] - df["a"].min() / (df["a"].max() - df["a"].min())

This code only has one input: df["a"]. To make the inputs more clear, it’s a good idea to rewrite the code using temporary variables with general names. Here this code only requires a single numeric vector, so we’ll call it x and put it into a function.

Functions begin with a def keyword for ‘define a function’. It then has a name, followed by brackets, (), which may contain function arguments and function keyword arguments. This is followed by a colon. The body of the function is then indented relative to the left-most text. Function arguments are defined in brackets following the name, with different inputs separated by commas. Any outputs are given with the return keyword, again with different variables separated by commas.

So, in Python, functions have the form:

def name_of_function(<inputs>):
    <code to carry out on inputs>
    <return statement, if appropriate>

So here it would be:

def rescale(x):
    return (x - x.min()) / (x.max() - x.min())

There is still some duplication in this code. We’re computing the minimum of the data twice, so it makes sense to do it in once:

def rescale(x):
    minimum = x.min()
    return (x - minimum) / (x.max() - minimum)

Pulling out intermediate calculations into named variables is a good practice because it makes it more clear what the code is doing.

There are three key steps to creating a new function:

  1. You need to pick a name for the function. Here we’ve used rescale because this function rescales a vector to lie between 0 and 1.

  2. You list the inputs, or arguments, to the function inside function. Here we have just one argument. If we had more the call would look like function(x, y, z). (We might also have a named keyword argument such as data= following the arguments.)

  3. You place the code you have developed in the body of the function, a block that immediately follows function(...):.

Note the overall process: we only made the function after we’d figured out how to make it work with a simple input. It’s easier to start with working code and turn it into a function; it’s harder to create a function and then try to make it work.

At this point it’s a good idea to check your function with a few different inputs:

rescale(pd.Series([-10, 0, 10]))
0    0.0
1    0.5
2    1.0
dtype: float64
rescale(pd.Series([1, 2, 3, np.nan, 5]))
0    0.00
1    0.25
2    0.50
3     NaN
4    1.00
dtype: float64

As you write more and more functions you’ll eventually want to convert these informal, interactive tests into formal, automated tests. That process is called unit testing. Unfortunately, it’s beyond the scope of this book.

We can simplify the original example now that we have a function:

df["a"] = rescale(df["a"])
df["b"] = rescale(df["b"])
df["c"] = rescale(df["c"])
df["d"] = rescale(df["d"])

Compared to the original, this code is easier to understand and we’ve eliminated one class of copy-and-paste errors. There is still quite a bit of duplication since we’re doing the same thing to multiple columns; we can actually remove this too, but we’ll cover how to do that later in the book.

Another advantage of functions is that if our requirements change, we only need to make the change in one place. For example, we might discover that some of our variables include infinite values, and rescale() (effectively) fails:

rescale(pd.Series([1, 2, 3, np.inf, 5]))
0    0.0
1    0.0
2    0.0
3    NaN
4    0.0
dtype: float64

Because we’ve extracted the code into a function, we only need to make the fix in one place:

def rescale(x):
    x = x.replace(np.inf, np.nan)
    minimum = x.min()
    return (x - minimum) / (x.max() - minimum)


rescale(pd.Series([1, 2, 3, np.inf, 5]))
0    0.00
1    0.25
2    0.50
3     NaN
4    1.00
dtype: float64

This is an important part of the “do not repeat yourself” (or DRY) principle. The more repetition you have in your code, the more places you need to remember to update when things change (and they always do!), and the more likely you are to create bugs over time.

27.4. Functions are for humans and computers#

It’s important to remember that functions are not just for the computer, but are also for humans. For the most part, Python doesn’t care what your function is called, or what comments it contains, but these are important for human readers. This section discusses some things that you should bear in mind when writing functions that humans can understand.

The name of a function is important. Ideally, the name of your function will be short, but clearly evoke what the function does. That’s hard! But it’s better to be clear than short, as Visual Studio Code’s autocomplete makes it easy to type long names.

Generally, function names should be verbs, and arguments should be nouns. There are some exceptions: nouns are ok if the function computes a very well known noun (i.e. mean is better than compute_mean), or accessing some property of an object (i.e. coef is better than get_coefficients). A good sign that a noun might be a better choice is if you’re using a very broad verb like “get”, “compute”, “calculate”, or “determine”. Use your best judgement and don’t be afraid to rename a function if you figure out a better name later.

# Not a verb, or descriptive
my_awesome_function()
# Long, but clear
impute_missing()
collapse_years()

If your function name is composed of multiple words, use “snake_case”, where each lowercase word is separated by an underscore. If you have a family of functions that do similar things, make sure they have consistent names and arguments. Use a common prefix to indicate that they are connected. That’s better than a common suffix because autocomplete allows you to type the prefix and see all the members of the family.

# Good
input_select()
input_checkbox()
input_text()
# Not so good
select_input()
checkbox_input()
text_input()

A good example of this design is the pandas package: if you don’t remember exactly which function you need to read in data, you can type pd.read_ and jog your memory as the autocomplete brings up the options.

27.5. Function Scope#

Scope refers to what parts of your code can see what other parts. There are three different scopes to bear in mind: local, global, and non-local.

Local

If you define a variable inside a function, the rest of your code won’t be able to ‘see’ it or use it. For example, here’s a function that creates a variable and then an example of calling that variable:

def var_func():
    str_variable = 'Hello World!'

var_func()
print(str_variable)

This would raise an error, because as far as your general code is concerned str_variable doesn’t exist outside of the function. This is an example of a local variable, one that only exists within a function.

If you want to create variables inside a function and have them persist, you need to explicitly pass them out using, for example return str_variable like this:

def var_func():
    str_variable = "Hello World!"
    return str_variable


returned_var = var_func()
print(returned_var)
Hello World!

Global

A variable declared outside of a function is known as a global variable because it is accessible everywhere:

y = "I'm a global variable"


def print_y():
    print("y is inside a function:", y)


print_y()
print("y is outside a function:", y)
y is inside a function: I'm a global variable
y is outside a function: I'm a global variable