5. Workflow: Style#
Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread. Even as a very new programmer it’s a good idea to work on your code style. Use a consistent style makes it easier for others (including future-you!) to read your work, and is particularly important if you need to get help from someone else.
This chapter will introduce you to some important style points drawn from Clean Code in Python, Tips for Better Coding from Coding for Economists, the UK Government Statistical Service’s Quality Assurance of Code for Analysis and Research guidance, and the bible of Python style guides, PEP 8 — Style Guide for Python Code.
Styling your code will feel a bit tedious to start with, but if you practice it, it will soon become second nature. Additionally, there are some great tools to quickly restyle existing code, like the Black Python package (“you can have any colour you like, as long as it’s black”).
Once you’ve installed Black by running pip install black
, you can use it on the command line (aka the terminal) within Visual Studio Code. Open up a terminal by clicking ‘Terminal -> New Terminal’ and then run black *.py
to apply a standard code style to all Python scripts in the current directory.
5.1. Names#
First, names matter. Use meaningful names for variables, functions, or whatever it is you’re naming. Avoid abbreviations that you understand now but which will be unclear to others, or future you. For example, use real_wage_hourly
over re_wg_ph
. I know it’s tempting to use temp
but you’ll feel silly later when you can’t for the life of you remember what temp
does or is. A good trick when naming booleans (variables that are either true or false) is to use is
followed by what the boolean variable refers to, for example is_married
.
As well as this general tip, Python has conventions on naming different kinds of variables. The naming convention for almost all objects is lower case separated by underscores, e.g. a_variable=10
or ‘this_is_a_script.py’. This style of naming is also known as snake case. There are different naming conventions though—Allison Horst made this fantastic cartoon of the different conventions that are in use.
Different naming conventions. Artwork by @allison_horst.
There are three exceptions to the snake case convention: classes, which should be in camel case, eg ThisIsAClass
; constants, which are in capital snake case, eg THIS_IS_A_CONSTANT
; and packages, which are typically without spaces or underscores and are lowercase thisisapackage
.
For some quick shortcuts to re-naming columns in pandas data frames or other string variables, try the unicode-friendly slugify library or the clean_headers()
function from the dataprep library.
The better named your variables, the clearer your code will be–and the fewer comments you will need to write!
In summary:
use descriptive variable names that reveal your intention, eg
days_since_treatment
avoid using ambiguous abbreviations in names, eg use
real_wage_hourly
overrw_ph
always use the same vocabulary, eg don’t switch from
worker_type
toemployee_type
avoid ‘magic numbers’, eg numbers in your code that set a key parameter. Set these as named constants instead. Here’s an example:
import random # This is bad def roll(): return random.randint(0, 36) # magic number! # This is good MAX_INT_VALUE = 36 def roll(): return random.randint(0, MAX_INT_VALUE)
use verbs for function names, eg
get_regression()
use consistent verbs for function names, don’t use
get_score()
andgrab_results()
(instead useget
for both)variable names should be snake_case and all lowercase, eg
first_name
class names should be CamelCase, eg
MyClass
function names should be snake_case and all lowercase, eg
quick_sort()
constants should be snake_case and all uppercase, eg
PI = 3.14159
modules should have short, snake_case names and all lowercase, eg
pandas
single quotes and double quotes are equivalent so pick one and be consistent—most automatic formatters prefer
"
5.2. Whitespace#
Surrounding bits of code with whitespace can significantly enhance readability. One such convention is that functions should have two blank lines following their last line. Another is that assignments are separated by spaces
this_is_a_var = 5
Another convention is that a space appears after a ,
, for example in the definition of a list we would have
list_var = [1, 2, 3, 4]
rather than
list_var = [1,2,3,4]
# or
list_var = [1 , 2 , 3 , 4]
5.3. Code Comments#
As mentioned previously, Python will ignore any text after #
. This allows to you to write comments, text that is ignored by Python but can be read by other humans. Comments can be helpful for briefly describing what the subsequent code does: use them to provide extra contextual information that isn’t conveyed by function and variable names.
Actually, well-written code needs fewer comments because it’s more evidence what’s going on. And it’s tempting not to update comments even when code changes. So do comment, but see if you can make the code tell its own story first.
Also, avoid “noise” comments that tell you what you already know from just looking at the code.
Finally, functions come with their own special type of comments called a doc string. Here’s an example that tells us all about the functions inputs and outputs, including the type of input and output (here a data frame, also known as pd.DataFrame
).
def round_dataframe(df: pd.DataFrame) -> pd.DataFrame:
"""Rounds numeric columns in dataframe to 2 s.f.
Args:
df (pd.DataFrame): Input dataframe
Returns:
pd.DataFrame: Dataframe with numbers rounded to 2 s.f.
"""
for col in df.select_dtypes("number"):
df[col] = df[col].apply(lambda x: float(f'{float(f"{x:.2g}"):g}'))
return df
5.4. Line width and line continuation#
For quite arbitrary historical reasons, PEP8 also suggests 79 characters for each line of code. Some find this too restrictive, especially with the advent of wider monitors. But it is good to split up very long lines. Anything that is contained in parenthesis can be split into multiple lines like so:
def function(input_one, input_two,
input_three, input_four):
result = (input_one,
+ input_two,
+ input_three,
+ input_four)
return result
When using method chaining (something you can see in action in Data Transformation) it’s necessary to put the chain inside parentheses and it’s good practice to use a new line for every method. The code snippet below gives an example of what good looks like:
import pandas as pd
df = pd.DataFrame(
data={
"col0": [0, 0, 0, 0],
"col1": [0, 0, 0, 0],
"col2": [0, 0, 0, 0],
"col3": ["a", "b", "b", "a"],
"col4": ["alpha", "gamma", "gamma", "gamma"],
},
index=["row" + str(i) for i in range(4)],
)
# Chaining inside parentheses works
results = df.groupby(["col3", "col4"]).agg({"col1": "count", "col2": "mean"})
results
col1 | col2 | ||
---|---|---|---|
col3 | col4 | ||
a | alpha | 1 | 0.0 |
gamma | 1 | 0.0 | |
b | gamma | 2 | 0.0 |
And this is what not to do:
results = df
.groupby(["col3", "col4"]).agg({"col1": "count", "col2": "mean"})
5.5. Principles of Clean Code#
While automation can help apply style, it can’t help you write clean code. Clean code is a set of rules and principles that helps to keep your code readable, maintainable, and extendable. Writing code is easy; writing clean code is hard! However, if you follow these principles, you won’t go far wong.
5.5.1. Do not repeat yourself (DRY)#
The DRY principle is ‘Every piece of knowledge or logic must have a single, unambiguous representation within a system.’ Divide your code into re-usable pieces that you can call when and where you want. Don’t write lengthy methods, but divide logic up into clearly differentiated chunks.
This saves having to repeat code, having no idea whether it’s this or that version of the same function doing the work, and will help your debugging efforts no end.
Some practical ways to apply DRY in practice are to use functions, to put functions or code that needs to be executed multiple times by multiple different scripts into another script (eg called utilities.py
) and then import it, and to think carefully if another way of writing your code would be more concise (yet still readable).
Tip
If you’re using Visual Studio Code, you can automatically send code into a function by right-clicking on code and using the ‘Extract to method’ option.
5.5.2. KISS (Keep It Simple, Stupid)#
Most systems work best if they are kept simple, rather than made complicated. This is a rule that says you should avoid unnecessary complexity. If your code is complex, it will only make it harder for you to understand what you did when you come back to it later.
5.5.3. SoC (Separation of Concerns) / Make it Modular#
Do not have a single file that does everything. If you split your code into separate, independent modules it will be easier to read, debug, test, and use. You can check the basics of coding chapter to see how to create and import functions from other scripts. But even within a script, you can still make your code modular by defining functions that have clear inputs and outputs.
A good rule of thumb is that if a code that achieves one end goes longer than about 30 lines, it should probably go into a function. Scripts longer than about 500 lines are ripe for splitting up too.
Relatedly, do not have a single function that tries to do everything. Functions should have limits too; they should do approximately one thing. If you’re naming a function and you have to use ‘and’ in the name then it’s probably worth splitting it into two functions.
Functions should have no ‘side effects’ either; that is, they should only take in value(s), and output value(s) via a return statement. They shouldn’t modify global variables or make other changes.
Another good rule of thumb is that each function shouldn’t have lots of separate arguments.
A final tip for modularity and the creation of functions is that you shouldn’t use ‘flags’ in functions (aka boolean conditions). Here’s an example:
# This is bad
def transform(text, uppercase):
if uppercase:
return text.upper()
else:
return text.lower()
# This is good
def uppercase(text):
return text.upper()
def lowercase(text):
return text.lower()