9. Workflow: Packages and Environments#
In this chapter, you’re going to learn about packages and how to install them plus virtual coding environments that keep your packages isolated and your projects reproducible.
9.1. Packages#
9.1.1. Introduction#
Packages (also called libraries) are key to extending the functionality of Python. The default installation of Anaconda comes with many (around 250) of the packages you’ll need, but it won’t be long before you’ll need to install some extra ones. There are packages for geoscience, for building websites, for analysing genetic data, for economics—pretty much for anything you can think of. Packages are typically not written by the core maintainers of the Python language but by enthusiasts, firms, researchers, academics, all sorts! Because anyone can write packages, they vary widely in their quality and usefulness. There are some that you’ll be seeing them again and again.
Name a more iconic trio, I'll wait. pic.twitter.com/pGaLuUxQ3r
— Vicki Boykis (@vboykis) August 23, 2018
The three Python packages numpy, pandas, and maplotlib, which respectively provide numerical, data analysis, and plotting functionality, are ubiquitous. So many scripts begin by importing all three of them, as in the tweet above!
There are typically two steps to using a new Python package:
install the package on the command line (aka the terminal), eg using
pip install pandas
import the package into your Python session, eg using
import pandas as pd
When you issue an install command for a specific package, it is automatically downloaded from the internet and installed in the appropriate place on your computer. To install extra Python packages, you issue install commands to a text-based window called the “terminal”.
9.1.2. The Command Line in Brief#
The terminal or command line or sometimes the command prompt was labelled 4 in the screenshot of Visual Studio Code from the chapter on First Steps. The terminal is a text-based way to issue all kinds of commands to your computer (not just Python commands) and knowing a little bit about it is really useful for coding (and more) because managing packages, environments (which we haven’t yet discussed), and version control (ditto) can all be done via the terminal. We’ll come to these in due course in the chapter on The Command Line, but for now, a little background on what the terminal is and what it does.
Note
To open up the command line within Visual Studio Code, use the ⌃ + ` keyboard shortcut (Mac) or ctrl + ` (Windows/Linux), or click “View > Terminal”.
Windows users may find it easiest to use the Anaconda Prompt as their terminal, at least for installing Python packages.
If you want to open up the command line independently of Visual Studio Code, search for “Terminal” on Mac and Linux, and “Anaconda Prompt” on Windows.
Firstly, everything you can do by clicking on icons to launch programmes on your computer, you can also do via the terminal, also known as the command line. For many programmes, a lot of their functionality can be accessed using the command line, and other programmes only have a command line interface (CLI), including some that are used for data science.
Tip
The command line interacts with your operating system and is used to create, activate, or change python installations.
Use Visual Studio Code to open a terminal window by clicking Terminal -> New Terminal on the list of commands at the very top of the window. If you have installed the Anaconda distribution of Python, your terminal should look something like this as your ‘command prompt’:
(base) your-username@your-computer current-directory %
on Mac, and the same but with ‘%’ replaced by ‘$’ on linux, and (using the Anaconda Prompt)
(base) C:\Users\YourUsername>
on Windows. If you don’t see the word (base)
at the start of the line, you may need to type conda activate
first.
The (base)
part is saying that your current Python environment is the base one (later, we’ll see how to add others for reproducibility and to isolate projects). Unfortunately, and confusingly, the commands that you can use in the terminal on Mac and Linux, on the one hand, and Windows, on the other, are different but many of the principles are the same.
For now, to at least try out the command line, let’s use something that works across all three of the major operating systems. Type python
on the command prompt that came up in your new terminal window. You should see information about your installation of Python appear, including the version, followed by a Python prompt that looks like >>>
. This is a kind of interactive Python session, in the terminal. It’s much less rich than the one available in Visual Studio Code (it can’t run scripts line-by-line, for example) but you can try print('Hello World!')
and it will run, printing your message. To exit the terminal-based Python session, type exit()
to go back to the regular command line.
9.1.3. Installing Packages#
To install extra Python packages, the default and easiest way is to use pip install **packagename**
. In true programming-humour style, pip is a recursive acronym that stands for ‘pip install packages’. There are over 330,000 Python packages on PyPI (the Python Package Index)! You can see what packages you have installed already by running conda list
into the command line.
pip install ...
will install packages into your default Anaconda environment, usually called “base”. You’ll need to have Anaconda “activated” before installing a package in the terminal–if you don’t see the name of an environment, eg (base)
, at the start of your terminal’s line, use the conda activate
command first. On Windows, this is usually the command prompt (available in the integrated Visual Studio Code terminal) or the Anaconda Command Prompt (available in the start menu).
There is a second way to install packages that directly uses conda
instead of pip
, which we’ll come to shortly in the context of different Python environments.
Here’s a full example of the commands used to install the pandas package into the base environment (you may not need the first one):
your-username@your-computer current-directory % conda activate
(base) your-username@your-computer current-directory % pip install pandas
Exercise
Try installing the matplotlib, pandas, statsmodels, and skimpy packages using pip install
.
9.1.4. Using Packages#
Once you have installed a package, you need to be able to use it! This is usually done via an import statement at the top of your script or Jupyter Notebook. For example, to bring in pandas, it’s
import pandas as pd
Why does Python do this? The idea of not just loading every package is to provide clarity over what function is being called from what package. It’s also not necessary to load every package for every piece of analysis, and you often actually want to know what the minimum set of packages is to reproduce an analysis. Making the package imports explicit helps with all of that.
You may also wonder why one doesn’t just use import pandas as pandas
. There’s actually nothing stopping you doing this except i) it’s convenient to have a shorter name and ii) there does tend to be a convention around imports, ie pd
for pandas and np
for numpy, and your code will be clearer to yourself and others if you follow the conventions.
9.2. Virtual Code Environments#
Virtual code environments allow you to isolate all of the packages that you’re using to do analysis for one project from the set of packages you might need for a different project. They’re an important part of creating a reproducible analytical pipeline but a key benefit is that others can reproduce the environment you used and it’s best practice to have an isolated environment per project.
To be more concrete, let’s say you’re using Python 3.9, statsmodels, and pandas for one project, project A. And, for project B, you need to use Python 3.10 with numpy and scikit-learn. Even with the same version of Python, best practice would be to have two separate virtual Python environments: environment A, with everything needed for project A, and environment B, with everything needed for project B. For the case where you’re using different versions of Python, this isn’t just best practice, it’s essential.
Many programming languages now come with an option to install packages and a version of the language in isolated environments. In Python, there are multiple tools for managing different environments. And, of those, the easiest to work with is probably Anaconda (conda for short).
9.2.1. Conda as a package manager#
To learn how to use virtual code environments, we need to make a brief detour into conda
as a package manager.
Conda does more than just provide a Python interpreter: it can also manage packages and different Python installations, aka Python environments. So you can also install packages with
conda install package-name -c conda-forge
This will try to install a version of the package that is already optimised for your type of computer, and will automatically come with any dependencies (packages the package you’re installing needs to run). The pre-built packages that are provided by Anaconda are convenient for a host of reasons. Anaconda provide pre-built versions of around 7,500 of the most popular packages (including the statistical programming language R). This is far less than PyPI but what you tend to find in practice is that Anaconda’s conda-forge
channel (-c
selects the conda-forge
“channel”) has most of what you need.
Okay, so how does this help with creating virtual environments? Because conda install ...
is able to select versions of packages that work well with your computer, it’s good at finding a combination of packages that will work well together without issue. With so many packages on PyPI, not all versions of all packages work together! Conda aims to solve that problem, and it makes working with virtual environments much nicer. But you can still pip install ...
in a specific conda environment when you need to (eg because that particular package isn’t available on conda-forge
).
You can see all of the packages in your (currently activated) conda environment by running conda list
on the command line. This book uses a conda environment, and here’s an example of looking at the installed packages within it, filtering them just to the ones beginning with “s”. You can see which packages are from conda-forge
and which are from PyPI.
conda list | grep ^s
scikit-learn 1.3.2 py310h417b086_1 conda-forge
scipy 1.11.3 py310hd1cfc7d_1 conda-forge
seaborn 0.12.2 pypi_0 pypi
send2trash 1.8.2 pyhd1c38e8_0 conda-forge
setuptools 68.2.2 pyhd8ed1ab_0 conda-forge
shapely 2.0.2 py310h656ff59_0 conda-forge
six 1.16.0 pyh6c4a22f_0 conda-forge
skimpy 0.0.11 pypi_0 pypi
snappy 1.1.10 h17c5cce_0 conda-forge
sniffio 1.3.0 pyhd8ed1ab_0 conda-forge
snowballstemmer 2.2.0 pyhd8ed1ab_0 conda-forge
soupsieve 2.5 pyhd8ed1ab_1 conda-forge
sphinx 5.0.2 pyh6c4a22f_0 conda-forge
sphinx-book-theme 1.0.1 pyhd8ed1ab_0 conda-forge
sphinx-comments 0.0.3 pyh9f0ad1d_0 conda-forge
sphinx-copybutton 0.5.2 pyhd8ed1ab_0 conda-forge
sphinx-design 0.3.0 pyhd8ed1ab_0 conda-forge
sphinx-external-toc 0.3.1 pyhd8ed1ab_1 conda-forge
sphinx-jupyterbook-latex 0.5.2 pyhd8ed1ab_0 conda-forge
sphinx-multitoc-numbering 0.1.3 pyhd8ed1ab_0 conda-forge
sphinx-thebe 0.2.1 pyhd8ed1ab_0 conda-forge
sphinx-togglebutton 0.3.2 pyhd8ed1ab_0 conda-forge
sphinxcontrib-applehelp 1.0.7 pyhd8ed1ab_0 conda-forge
sphinxcontrib-bibtex 2.5.0 pyhd8ed1ab_0 conda-forge
sphinxcontrib-devhelp 1.0.5 pyhd8ed1ab_0 conda-forge
sphinxcontrib-htmlhelp 2.0.4 pyhd8ed1ab_0 conda-forge
sphinxcontrib-jsmath 1.0.1 pyhd8ed1ab_0 conda-forge
sphinxcontrib-qthelp 1.0.6 pyhd8ed1ab_0 conda-forge
sphinxcontrib-serializinghtml 1.1.9 pyhd8ed1ab_0 conda-forge
sqlalchemy 1.4.49 pypi_0 pypi
sqlalchemy2-stubs 0.0.2a35 pypi_0 pypi
sqlglot 18.17.0 pypi_0 pypi
sqlite 3.43.2 hf2abe2d_0 conda-forge
sqlmodel 0.0.11 pypi_0 pypi
stack_data 0.6.2 pyhd8ed1ab_0 conda-forge
statsmodels 0.14.0 py310h50ce23c_2 conda-forge
Note: you may need to restart the kernel to use updated packages.
9.2.2. Using Anaconda to Manage Python Environments#
Okay, we’re now ready to look at using conda to manager Python environments. Much of these two subsections is covered by the Anaconda documentation on managing virtual environments.
If you’re using Anaconda, you manage and change environments on the command line (remember, there’s much more on the command line in The Command Line). Before following these instructions, check that you have Anaconda installed and activated. You should see something like (base) username@computername:~$
on the command line (base is the default conda environment).
To create a new environment called “myenv” with a specific version of Python (but no extra packages installed), it’s
conda create -n myenv python=3.8
where you can of course specify other versions of Python by changing the number. To throw in a package or two, just add them to the end, for example
conda create -n myenv python=3.8 pandas jupyter
You can see a list of the currently installed environments by running
conda env list
When you install Anaconda, you will begin with a “base” environment. As noted, it’s best practice not to use this for projects but to instead to create a new environment for each project.
There are two downsides to installing environments directly from the command line. One is that you may have lots of packages. The second is that you may wish to keep a record of the environment you created! This is really good practice, because it helps you to make your work more reproducible. For both of these reasons, you can specify a conda environment using a file.
A very simple environment that just had pandas and an interactive console would look like this in a file:
name: myenv
channels:
- conda-forge
dependencies:
- python=3.9
- pandas
- jupyter
The environment is given by name
, the channel (where to look for the packages) by channels
, the specific packages by dependencies
. Not all packages are available on conda’s channels, so sometimes extra ones are needed. By specifying conda-forge
we get the widest possible selection of packages. But, as we noted before, some packages are only available on PyPI (pip
); these can be specified with a sub-section of the file like so for the skimpy package:
name: myenv
channels:
- conda-forge
dependencies:
- python=3.8
- pandas
- jupyter
- pip:
- skimpy
This goes into a file called environment.yml
, which can be installed by running
conda env create -f environment.yml
This book is put together using an isolated conda environment specified in a file. It’s an unusually big one because there are a lot of packages featured in the book! Here they are:
Show code cell source
from rich import print
with open("environment.yml", 'r') as stream:
data_loaded = stream.read()
print(data_loaded)
name: py4ds2e channels: - conda-forge dependencies: - jupyter - numpy - pandas - pip - python=3.10 - pyyaml - scipy - statsmodels - yaml - pycodestyle - autopep8 - pandas-datareader - pandasdmx - jupyter-book - pytest - pre-commit - jupyterlab - nbstripout - ghp-import - pip - black - black-jupyter - beautifulsoup4 - geopandas - pip: - skimpy - pyarrow - watermark - graphviz - openpyxl - sqlmodel - ibis-framework - lets-plot - polars - palmerpenguins - pandas-profiling - rich
Of course, you can install packages as you go too, you don’t have to specify them when you create the environment. With the relevant environment activated, use conda install packagename
to do this.
Finally, to remove an environment, it’s
conda remove --name myenv --all
9.2.3. Using and Switching Between Conda Environments#
To switch between conda environments on the command line, for example from the base environment to an environment called “myenv”, use
conda activate myenv
on the command line. However, this only switches the environment if you plan to run code on the command line!
Fortunately, Visual Studio Code has you covered and makes it very easy to switch Python environments for a project at the click of a button.
In the screenshot above, you can see the project-environment in two places: on the blue bar at the bottom of the screen, and (in 5), at the top right hand side of the interactive window. Click on either to change the Python environment that will be used to execute code. A similar top right selector is present for Jupyter Notebooks too.