10. Postscript: Getting Further Help#

This book is not an island; there is no single resource that will allow you to master Python for Data Science. As you begin to apply the techniques described in this book to your own data, you will soon find questions that we do not answer. This section describes a few tips on how to get help, and to help you keep learning.

10.1. Resources#

Some other resources for learning are:

10.2. Google is your friend#

If you get stuck, start with Google. Typically adding “Python” or “Python Data Science” (as the Python ecosystem goes well beyond data science) to a query is enough to restrict it to relevant results. Google is particularly useful for error messages. If you get an error message and you have no idea what it means, try googling it! Chances are that someone else has been confused by it in the past, and there will be help somewhere on the web.

If Google doesn’t help, try Stack Overflow. Start by spending a little time searching for an existing answer, including [Python] to restrict your search to questions and answers that use Python.

10.3. In the loop#

It’s also helpful to keep an eye on the latest developments in data science. There are tons of data science newsletters out there, and we recommend keeping up with the Python data science community by following the (#pydata), (#datascience), and (#python) hashtags on Twitter.

10.4. Making a reprex (reproducible example)#

If your googling doesn’t find anything useful, it’s a really good idea prepare a minimal reproducible example or reprex.

A good reprex makes it easier for other people to help you, and often you’ll figure out the problem yourself in the course of making it. There are two parts to creating a reprex:

  • First, you need to make your code reproducible. This means that you need to capture everything, i.e., include any packages you used and create all necessary objects. The easiest way to make sure you’ve done this is to use the watermark package alongside whatever else you are doing:

import pandas as pd
import numpy as np
import watermark.watermark as watermark

print(watermark(iversions=True, globals_=globals()))
Last updated: 2024-01-16T10:34:33.118404+00:00

Python implementation: CPython
Python version       : 3.10.13
IPython version      : 8.16.1

Compiler    : Clang 16.0.6 
OS          : Darwin
Release     : 23.2.0
Machine     : arm64
Processor   : i386
CPU cores   : 10
Architecture: 64bit

numpy : 1.25.2
pandas: 2.0.3
  • Second, you need to make it minimal. Strip away everything that is not directly related to your problem. This usually involves creating a much smaller and simpler Python object than the one you’re facing in real life or even using built-in data.

That sounds like a lot of work! And it can be, but it has a great payoff:

  • 80% of the time creating an excellent reprex reveals the source of your problem. It’s amazing how often the process of writing up a self-contained and minimal example allows you to answer your own question.

  • The other 20% of time you will have captured the essence of your problem in a way that is easy for others to play with. This substantially improves your chances of getting help.

There are several things you need to include to make your example reproducible: Python environment, required packages, data, and code.

  • Python environment–really just the Python version. This is covered by the first call to the watermark package.

  • Packages and their versions. These should be loaded at the top of the script, so it’s easy to see which ones the example needs. By using watermark with the above configuration, you will also print the package versions. This is a good time to check that you’re using the latest version of each package; it’s possible you’ve discovered a bug that’s been fixed since you installed or last updated the package.

  • Data: as others won’t be able to easily download the data you’re working with, it’s often best to create a small amount of data from code that still have the same problem as you’re finding with your actual data. Between numpy and pandas, it’s quite easy to generate data from code; here’s an example:

df = pd.DataFrame(
    data=np.reshape(range(36), (6, 6)),
    index=["a", "b", "c", "d", "e", "f"],
    columns=["col" + str(i) for i in range(6)],
df["random_normal"] = np.random.normal(size=6)
col0 col1 col2 col3 col4 col5 random_normal
a 0.0 1.0 2.0 3.0 4.0 5.0 -0.322713
b 6.0 7.0 8.0 9.0 10.0 11.0 -0.254159
c 12.0 13.0 14.0 15.0 16.0 17.0 0.010210
d 18.0 19.0 20.0 21.0 22.0 23.0 -0.857370
e 24.0 25.0 26.0 27.0 28.0 29.0 -1.232134
f 30.0 31.0 32.0 33.0 34.0 35.0 -1.115775
  • Code: copy and paste the minimal reproducible example code (including the packages, as noted above). Make sure you’ve used spaces and your variable names are concise, yet informative. Use comments to indicate where your problem lies. Do your best to remove everything that is not related to the problem. Finally, the shorter your code is, the easier it is to understand, and the easier it is to fix.

Finish by checking that you have actually made a reproducible example by starting a fresh Python session and copying and pasting your reprex in.