{
"cells": [
{
"cell_type": "markdown",
"id": "95f0a171",
"metadata": {},
"source": [
"(numbers)=\n",
"# Numbers\n",
"\n",
"## Introduction\n",
"\n",
"In this chapter, you'll learn useful tools for creating and manipulating numeric vectors. We'll start by going into a little more detail of `.count()` before diving into various numeric transformations. You'll then learn about more general transformations that can be applied to other types of column, but are often used with numeric column. Then you'll learn about a few more useful summaries.\n",
"\n",
"### Prerequisites\n",
"\n",
"This chapter mostly uses functions from **pandas**, which you are likely to already have installed bu you can install using `pip install pandas` in the terminal. We'll use real examples from nycflights13, as well as toy examples made with fake data.\n",
"\n",
"Let's first load up the NYC flights data\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "51a55374",
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"import matplotlib_inline.backend_inline\n",
"import matplotlib.pyplot as plt\n",
"\n",
"# Plot settings\n",
"plt.style.use(\"https://github.com/aeturrell/python4DS/raw/main/plot_style.txt\")\n",
"matplotlib_inline.backend_inline.set_matplotlib_formats(\"svg\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6c89ca3d",
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"url = \"https://raw.githubusercontent.com/byuidatascience/data4python4ds/master/data-raw/flights/flights.csv\"\n",
"flights = pd.read_csv(url)"
]
},
{
"cell_type": "markdown",
"id": "abd1fd34",
"metadata": {},
"source": [
"### Counts\n",
"\n",
"It's surprising how much data science you can do with just counts and a little basic arithmetic, so **pandas** strives to make counting as easy as possible with `.count()` and `.value_counts()`. The former just provides a straight count of all the non NA items:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18f1ee4c",
"metadata": {},
"outputs": [],
"source": [
"flights[\"dest\"].count()"
]
},
{
"cell_type": "markdown",
"id": "80e8140f",
"metadata": {},
"source": [
"The latter provides a count broken down by type:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "161a24ea",
"metadata": {},
"outputs": [],
"source": [
"flights[\"dest\"].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "cdb110f3",
"metadata": {},
"source": [
"This is automatically sorted in order of the most common category. You can perform the same computation \"by hand\" with `group_by()`, `agg()` and then using the count function. This is useful because it allows you to compute other summaries at the same time:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8554277a",
"metadata": {},
"outputs": [],
"source": [
"(\n",
" flights.groupby([\"dest\"])\n",
" .agg(\n",
" mean_delay=(\"dep_delay\", \"mean\"),\n",
" count_flights=(\"dest\", \"count\"),\n",
" )\n",
" .sort_values(by=\"count_flights\", ascending=False)\n",
")"
]
},
{
"cell_type": "markdown",
"id": "9e2b23fa",
"metadata": {},
"source": [
"Note that a weighted count is just a sum. For example you could \"count\" the number of miles each plane flew:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "061decae",
"metadata": {},
"outputs": [],
"source": [
"(flights.groupby(\"tailnum\").agg(miles=(\"distance\", \"sum\")))"
]
},
{
"cell_type": "markdown",
"id": "1b489765",
"metadata": {},
"source": [
"You can count missing values by combining `sum()` and `isnull()`. In the flights dataset this represents flights that are cancelled. Note that because there isn't a simple string name for applying `.isnull()` followed by `.sum()` (unlike just running `sum()`, which would be given by the string \"sum\"), we need to use a lambda function in the below:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ecdb5630",
"metadata": {},
"outputs": [],
"source": [
"(flights.groupby(\"dest\").agg(n_cancelled=(\"dep_time\", lambda x: x.isnull().sum())))"
]
},
{
"cell_type": "markdown",
"id": "8587dbe2",
"metadata": {},
"source": [
"## Numeric Transformations\n",
"\n",
"Transformation functions have an output is the same length as the input. The vast majority of transformation functions are either built into Python or come with the numerical package **numpy**. It's impractical to list all the possible numerical transformations so this section will show the most useful ones.\n",
"\n",
"Basic number arithmatic is achieved by `+` (addition), `-` (subtraction), `*` (multiplication), `/` (division), `**` (powers), `%` (modulo), and `@` (tensor product). Most of these functions don't need a huge amount of explanation because you'll be familiar with them already (and you can look up the others when you do need them).\n",
"\n",
"When you have two numeric columns of equal length and you add or subtract them, it's pretty obvious what's going to happen. But we do need to talk about what happens when there is a variable involved that is *not* as long as the column. This is important for operations like `flights.assign(air_time = air_time / 60)` because there are 336,776 numbers on the left of `/` but only one on the right. In this case, **pandas** will understand that you'd like to divide *all* values of air time by 60. This is sometimes called 'broadcasting'. Below is a digram that tries to explain what's going on:\n",
"\n",
"![](https://numpy.org/doc/stable/_images/broadcasting_1.png)\n",
"\n",
"You can find out much more about [broadcasting on the **numpy** documentation](https://numpy.org/doc/stable/user/basics.broadcasting.html). **pandas** is built on top of **numpy** and inherits some of its functionality. \n"
]
},
{
"cell_type": "markdown",
"id": "adb10890",
"metadata": {},
"source": [
"When operating on two columns, **pandas** compares their shapes element-wise. Two columns are compatible when they are equal, or one of them is a scalar. If these conditions are not met, you will get an error.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "e2e62aed",
"metadata": {},
"source": [
"### Minimum and Maximum\n",
"\n",
"The arithmetic functions do what you'd expect."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a473cd56",
"metadata": {},
"outputs": [],
"source": [
"flights[\"distance\"].max()"
]
},
{
"cell_type": "markdown",
"id": "dd24faa5",
"metadata": {},
"source": [
"Sometimes, you'd like to look at the maximum or minimum value across rows *or* columns. As often is the case with **pandas**, you can specify rows or columns to apply functions to by passing `axis=0` (index) or `axis=1` (columns) to that function. The axis designation can be confusing: remember that you are asking which dimension you wish to aggregate over, leaving you with the other dimension. So if we wish to find the minimum in each row, we aggregate / collapse columns, so we need to pass `axis=1`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "96285702",
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame({\"x\": [1, 5, 7], \"y\": [3, 2, pd.NA]})\n",
"df"
]
},
{
"cell_type": "markdown",
"id": "8aabe248",
"metadata": {},
"source": [
"Now let's find the min by row:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5bae5499",
"metadata": {},
"outputs": [],
"source": [
"df.min(axis=1)"
]
},
{
"cell_type": "markdown",
"id": "b63b455a",
"metadata": {},
"source": [
"### Modular arithmetic\n",
"\n",
"Modular arithmetic is the technical name for the type of math you do on whole numbers, i.e. division that yields a whole number and a remainder. In Python, `//` does integer division and `%` computes the remainder:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3af6ce86",
"metadata": {},
"outputs": [],
"source": [
"print([x for x in range(1, 11)])\n",
"print(\"divided by 3 gives\")\n",
"print(\"remainder:\")\n",
"print([x % 3 for x in range(1, 11)])\n",
"print(\"divisions:\")\n",
"print([x // 3 for x in range(1, 11)])"
]
},
{
"cell_type": "markdown",
"id": "449a7811",
"metadata": {},
"source": [
"Modular arithmetic is handy for the flights dataset, because we can use it to unpack the `sched_dep_time` variable into and `hour` and `minute`:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6be57e34",
"metadata": {},
"outputs": [],
"source": [
"flights.assign(\n",
" hour=lambda x: x[\"sched_dep_time\"] // 100,\n",
" minute=lambda x: x[\"sched_dep_time\"] % 100,\n",
")"
]
},
{
"cell_type": "markdown",
"id": "1fbfc9c7",
"metadata": {},
"source": [
"### Logarithms\n",
"\n",
"Logarithms are an incredibly useful transformation for dealing with data that ranges across multiple orders of magnitude. They also convert exponential growth to linear growth. For example, take compounding interest --- the amount of money you have at `year + 1` is the amount of money you had at `year` multiplied by the interest rate. That gives a formula like `money = starting * interest ** year`:\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a3ffc085",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"\n",
"starting = 100\n",
"interest = 1.05\n",
"money = pd.DataFrame(\n",
" {\"year\": 2000 + np.arange(1, 51), \"money\": starting * interest ** np.arange(1, 51)}\n",
")\n",
"money.head()"
]
},
{
"cell_type": "markdown",
"id": "4bf4afd0",
"metadata": {},
"source": [
"If you plot this data, you'll get an exponential curve:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "aaad2abb",
"metadata": {},
"outputs": [],
"source": [
"money.plot(x=\"year\", y=\"money\");"
]
},
{
"cell_type": "markdown",
"id": "f42a0f24",
"metadata": {},
"source": [
"Log transforming the y-axis gives a straight line:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bbcf4d7c",
"metadata": {},
"outputs": [],
"source": [
"money.plot(x=\"year\", y=\"money\", logy=True);"
]
},
{
"cell_type": "markdown",
"id": "5be9a267",
"metadata": {},
"source": [
"This a straight line because `log(money) = log(starting) + n * log(interest)` matches the pattern for a line, `y = m * x + b`. This is a useful pattern: if you see a (roughly) straight line after log-transforming the y-axis, you know that there's underlying exponential growth.\n",
"\n",
"If you're log-transforming your data you have a choice of a lot of logarithms provided by **numpy** but there are three ones you'll want to commonly use: assuming you've imported it using `import numpy as np`, you have `np.log()` (the natural log, base e), `np.log2()` (base 2), and `np.log10()` (base 10).\n",
"\n",
"The inverse of `log()` is `np.exp()`; to compute the inverse of `np.log2()` or `np.log10()` you'll need to use `2**` or `10**`."
]
},
{
"cell_type": "markdown",
"id": "da97da51",
"metadata": {},
"source": [
"### Rounding\n",
"\n",
"To round to a specific number of decimal places, use `.round(n)` where `n` is the number of decimal places that you wish to round to."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ec1a09ca",
"metadata": {},
"outputs": [],
"source": [
"money.head().round(2)"
]
},
{
"cell_type": "markdown",
"id": "2b26e633",
"metadata": {},
"source": [
"This can also be applied to individual columns or differentially to columns via a dictionary:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9a306519",
"metadata": {},
"outputs": [],
"source": [
"money.head().round({\"year\": 0, \"money\": 1})"
]
},
{
"cell_type": "markdown",
"id": "d8fa46f8",
"metadata": {},
"source": [
"`.round(n)` rounds to the nearest `10**(-n)` so `n = 2` will round to the nearest 0.01. This definition is useful because it implies `.round(-2)` will round to the nearest hundred, which indeed it does:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fb9dfb59",
"metadata": {},
"outputs": [],
"source": [
"money.tail().round({\"year\": 0, \"money\": -2})"
]
},
{
"cell_type": "markdown",
"id": "6544b94e",
"metadata": {},
"source": [
"Sometimes, you'll want to round according to significant figures rather than decimal places. There's not a really easy way to do this but you can define a custom function to do it. Here's an example of rounding to 2 significant figures (change the `2` in the below to round to a different number of significant figures):"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f0290c0a",
"metadata": {},
"outputs": [],
"source": [
"money[\"money\"].head().apply(lambda x: float(f'{float(f\"{x:.2g}\"):g}'))"
]
},
{
"cell_type": "markdown",
"id": "c4a1252d",
"metadata": {},
"source": [
"If you have an array or list of numbers outside of a data frame, you can use the **numpy** function"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6a6a4644",
"metadata": {},
"outputs": [],
"source": [
"np.round([1.5, 2.5, 1.4])"
]
},
{
"cell_type": "markdown",
"id": "9f87876c",
"metadata": {},
"source": [
"**numpy** has both `.floor()` and `.ceil()` methods too"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6270c3cc",
"metadata": {},
"outputs": [],
"source": [
"real_nums = 100 * np.random.random(size=10)\n",
"real_nums"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dc3608d6",
"metadata": {},
"outputs": [],
"source": [
"np.ceil(real_nums)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d577bb21",
"metadata": {},
"outputs": [],
"source": [
"np.floor(real_nums)"
]
},
{
"cell_type": "markdown",
"id": "c22b0eda",
"metadata": {},
"source": [
"Remember that you can always apply **numpy** functions to **pandas** data frame columns like so:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dca417b7",
"metadata": {},
"outputs": [],
"source": [
"money[\"money\"].head().apply(np.ceil)"
]
},
{
"cell_type": "markdown",
"id": "d3c1710e",
"metadata": {},
"source": [
"### Cumulative and Rolling Aggregates\n",
"\n",
"**pandas** has several cumulative functions, including `.cumsum()`, `.cummax()` and `.cummin()`, and `.cumprod()`. "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c5d77818",
"metadata": {},
"outputs": [],
"source": [
"money[\"money\"].tail().cumsum()"
]
},
{
"cell_type": "markdown",
"id": "c4b569cb",
"metadata": {},
"source": [
"As ever, this can be applied across rows too by passing `axis=1`."
]
},
{
"cell_type": "markdown",
"id": "0b4355c7",
"metadata": {},
"source": [
"## General transformations\n",
"\n",
"The following sections describe some general transformations that are often used with numeric vectors, but which can be applied to all other column types.\n",
"\n",
"### Ranking\n",
"\n",
"**pandas**' rank function is `.rank()`. Let's look back at the data we made earlier and rank it:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "db1272a3",
"metadata": {},
"outputs": [],
"source": [
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "11ee0bac",
"metadata": {},
"outputs": [],
"source": [
"df.rank()"
]
},
{
"cell_type": "markdown",
"id": "85345760",
"metadata": {},
"source": [
"Of course, there's no change here because the items were already ranked! We can also do a pct rank by passing the keyword argument `pct=True`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "edc7bd81",
"metadata": {},
"outputs": [],
"source": [
"df.rank(pct=True)"
]
},
{
"cell_type": "markdown",
"id": "8914dffd",
"metadata": {},
"source": [
"### Offsets and Shifting\n",
"\n",
"Offsets allow you to 'roll' columns up and down relative to their original position, ie to offset their location relative to the index. The function that does this is called `shift()` and it produces leads or lags depending on whether you use positive or negative values respectively. Remember that a lead is going to shift the pattern in the data to the left when plotted against the index, while the lag is going to shift patterns to the right. Leads and lags are particularly useful for time series data (which we haven't yet seen).\n",
"\n",
"Let's see an example of offsets with the `money` data frame from earlier:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9e499d1b",
"metadata": {},
"outputs": [],
"source": [
"money[\"money_lag_5\"] = money[\"money\"].shift(5)\n",
"money[\"money_lead_10\"] = money[\"money\"].shift(-10)\n",
"money.set_index(\"year\").plot();"
]
},
{
"cell_type": "markdown",
"id": "773ad50a",
"metadata": {},
"source": [
"### Exercises\n",
"\n",
"1. Find the 10 most delayed flights using a ranking function.\n",
"\n",
"2. Which plane (`\"tailnum\"`) has the worst on-time record?\n",
"\n",
"3. What time of day should you fly if you want to avoid delays as much as possible?\n",
"\n",
"4. For each destination, compute the total minutes of delay. For each flight, compute the proportion of the total delay for its destination.\n",
"\n",
"5. Delays are typically temporally correlated: even once the problem that caused the initial delay has been resolved, later flights are delayed to allow earlier flights to leave. Using `.shift()`, explore how the average flight delay for an hour is related to the average delay for the previous hour.\n",
"\n",
"6. Look at each destination. Can you find flights that are suspiciously fast? (i.e. flights that represent a potential data entry error). Compute the air time of a flight relative to the shortest flight to that destination. Which flights were most delayed in the air?\n",
"\n",
"7. Find all destinations that are flown by at least two carriers. Use those destinations to come up with a relative ranking of the carriers based on their performance for the same destination.\n"
]
},
{
"cell_type": "markdown",
"id": "005a82f5",
"metadata": {},
"source": [
"## More Useful Summary Statistics\n",
"\n",
"We've seen how useful `.mean()`, `.count()`, and `.value_counts()` can be for analysis. **pandas** has a great many more built-in summary statistics functions, however. These include `.median()` (you may find it interesting to compare the mean vs the median when looking at the hourly departure delay in the flights data), `.mode()`, `.min()`, and `.max()`.\n",
"\n",
"A class of useful summary statistics are provided by the `.quantile` function, which is the same as `median` for `.quantile(0.5)`. The quantile at x% is the value that x% of values are below. (Note that under this definition, `.quantile(1)` will be the same as `.max()`.) Let's see an example with the 25th percentile."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "81389031",
"metadata": {},
"outputs": [],
"source": [
"money[\"money\"].quantile(0.25)"
]
},
{
"cell_type": "markdown",
"id": "02c212f4",
"metadata": {},
"source": [
"Sometimes you don't just want one percentile, but a bunch of them. **pandas** makes this very easy by allowing you to pass a list of quantiles:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9eb35866",
"metadata": {},
"outputs": [],
"source": [
"money[\"money\"].quantile([0, 0.25, 0.5, 0.75])"
]
},
{
"cell_type": "markdown",
"id": "224b6a3a",
"metadata": {},
"source": [
"### Spread\n",
"\n",
"Sometimes you're not so interested in where the bulk of the data lies, but how it is spread out.\n",
"Two commonly used summaries are the standard deviation, `.std()`, and the inter-quartile range, which you can compute from the relevant quantiles, ie it's the quantile at 75% minus the quantile at 25% and gives you the range that contains the middle 50% of the data.\n",
"\n",
"We can use this to reveal a small oddity in the flights data. You might expect that the spread of the distance between origin and destination to be zero, since airports are always in the same place. But the code below makes it looks like one airport, [EGE](https://en.wikipedia.org/wiki/Eagle_County_Regional_Airport), might have moved."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e2efca3a",
"metadata": {},
"outputs": [],
"source": [
"(\n",
" flights.groupby([\"origin\", \"dest\"])\n",
" .agg(\n",
" distance_sd=(\"distance\", lambda x: x.quantile(0.75) - x.quantile(0.25)),\n",
" count=(\"distance\", \"count\"),\n",
" )\n",
" .query(\"distance_sd > 0\")\n",
")"
]
},
{
"cell_type": "markdown",
"id": "ce5e572d",
"metadata": {},
"source": [
"### Distributions\n",
"\n",
"It's worth remembering that all of the summary statistics described above are a way of reducing the distribution down to a single number. This means that they're fundamentally reductive, and if you pick the wrong summary, you can easily miss important differences between groups. That's why it's always a good idea to visualise the distribution of values as well as using aggregate statistics.\n",
"\n",
"The plot below shows the overall distribution of departure delays. The distribution is so skewed that we have to zoom in to see the bulk of the data. This suggests that the mean is unlikely to be a good summary and we might prefer the median instead."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b5ac4861",
"metadata": {},
"outputs": [],
"source": [
"flights[\"dep_delay\"].plot.hist(bins=50, title=\" Distribution: length of delay\");"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d4e76051",
"metadata": {},
"outputs": [],
"source": [
"flights.query(\"dep_delay <= 120\")[\"dep_delay\"].plot.hist(\n",
" bins=50, title=\" Distribution: length of delay\"\n",
");"
]
},
{
"cell_type": "markdown",
"id": "225e6704",
"metadata": {},
"source": [
"Two histograms of `\"dep_delay\"`. On the former, it's very hard to see any pattern except that there's a very large spike around zero, the bars rapidly decay in height, and for most of the plot, you can't see any bars because they are too short to see. On the latter, where we've discarded delays of greater than two hours, we can see that the spike occurs slightly below zero (i.e. most flights leave a couple of minutes early), but there's still a very steep decay after that.\n",
"\n",
"Don't be afraid to explore your own custom summaries specifically tailored for the data that you're working with. In this case, that might mean separately summarising the flights that left early vs the flights that left late, or given that the values are so heavily skewed, you might try a log-transformation. Finally, don't forget that it's always a good idea to include the number of observations in each group when creating summaries."
]
},
{
"cell_type": "markdown",
"id": "9f2f3523",
"metadata": {},
"source": [
"### Exercises\n",
"\n",
"1. Brainstorm at least 5 different ways to assess the typical delay characteristics of a group of flights.\n",
" Consider the following scenarios:\n",
"\n",
" - A flight is 15 minutes early 50% of the time, and 15 minutes late 50% of the time.\n",
" - A flight is always 10 minutes late.\n",
" - A flight is 30 minutes early 50% of the time, and 30 minutes late 50% of the time.\n",
" - 99% of the time a flight is on time. 1% of the time it's 2 hours late.\n",
"\n",
" Which do you think is more important: arrival delay or departure delay?\n",
"\n",
"2. Which destinations show the greatest variation in air speed?\n",
"\n",
"3. Create a plot to further explore the adventures of EGE.\n",
" Can you find any evidence that the airport moved locations?"
]
}
],
"metadata": {
"interpreter": {
"hash": "9d7534ecd9fbc7d385378f8400cf4d6cb9c6175408a574f1c99c5269f08771cc"
},
"jupytext": {
"cell_metadata_filter": "-all",
"encoding": "# -*- coding: utf-8 -*-",
"formats": "md:myst",
"main_language": "python"
},
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
},
"toc-showtags": true
},
"nbformat": 4,
"nbformat_minor": 5
}