pandas alternatives, validation, and orchestration#

Introduction#

In this chapter, you’ll learn about data validation, data orchestration, and some alternatives to the pandas package.

Imports#

import matplotlib.pyplot as plt
import pandas as pd

Alternatives to pandas#

pandas isn’t the only game in town, not by a long way—though it’s by far the best supported and the most fully featured. pandas is a good default package to use for data in the vast majority of cases. But it’s always good to have options—or, put another way, options have value! Other dataframe libraries may have a syntax that you prefer (datatable’s syntax is popular with some) or provide a speed-up in certain situations, for example when working with large datasets (polars promises to breeze through big data far faster than pandas can).

If you’re specifically interested in how different dataframe options perform on increasingly large datasets, take a look at the benchmarks by the folks who make DuckDB here.

If you only choose one pandas alternative, we recommend polars. It brings the benefits of speed and a consistent API.

Here’s a quick run-through of some alternatives to pandas:

Polars#

Polars#

Polars is one of the fastest dataframe libraries in any language. It also uses Apache Arrow as backend. It currently consists of an ‘eager’ (for datasets smaller than approximately a few GB) interface that’s very similar to pandas and a ‘lazy’ interface (don’t worry if you don’t know what that means, it’s a big data thing) that is somewhat similar to spark (a big data tool). Polars is built on the Rust language. It’s particularly effective at merging datasets.

tidypolars#

tidypolars combines the syntax of the dplyr, from the R statistical language, with the best-in-class speed (for any language, as of the time of writing) of Python package polars. Here’s an example of a typical command to give you a flavour of what the syntax looks like:

import tidypolars as tp
from tidypolars import col, desc

df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])

(
    df
    .select('x', 'y', 'z')
    .filter(col('x') < 4, col('y') > 1)
    .arrange(desc('z'), 'x')
    .mutate(double_x = col('x') * 2,
            x_plus_y = col('x') + col('y'))
)
┌─────┬─────┬─────┬──────────┬──────────┐
│ x   ┆ y   ┆ z   ┆ double_x ┆ x_plus_y │
│ --- ┆ --- ┆ --- ┆ ---      ┆ ---      │
│ i64 ┆ i64 ┆ str ┆ i64      ┆ i64      │
╞═════╪═════╪═════╪══════════╪══════════╡
│ 2   ┆ 5   ┆ b   ┆ 4        ┆ 7        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0   ┆ 3   ┆ a   ┆ 0        ┆ 3        │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1   ┆ 4   ┆ a   ┆ 2        ┆ 5        │
└─────┴─────┴─────┴──────────┴──────────┘

Datar#

Another library that gets close to dplylr syntax is datar. It integrates with plotnine for visualisation.

Data Validation#

When working with data, it’s useful to have a way to validate that what you’re getting is what you think you’re getting. And so various data validation methods have appeared that will fail loudly and proudly when your data are not as expected. Although most useful in a situation where data are being regularly updated, for example when producing a monthly index, these tools can also be useful if you want to be extra careful with your data in a research paper, if you are scaling up from a sample to a bigger dataset, or if you are updating an existing project.

Of course, the most minimal way to do some validation is to ensure that your dataframe has its columns cast as the right data types from the start, and this is good practice in general. That’s about data type, but these tools go beyond that and deal with values within a data type. So rather than just saying it’s an integer, a data validation library might say that a column must be an integer between 0 and 100.

The normal way that data validation libraries are used is that you specify some expectations you have of the dataset that will be flagged if and when they fail. Let’s see some of the tools that are available that do this. As ever, this chapter has benefitted enormously from the documentation of the featured packages.

Pandera#

Pandera [Niels Bantilan, 2020] has tight integration with pandas (as the name suggests). There are several different ways to specify data validation checks.

Dataframe style checks#

It’s easiest to show how it works with an example. We’ll first create some data that we’d like to validate.

import pandera as pa

# data to validate
df = pd.DataFrame(
    {
        "column1": [1, 4, 0, 10, 9],
        "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
        "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
    }
)

df
column1 column2 column3
0 1 -1.3 value_1
1 4 -1.4 value_2
2 0 -2.9 value_3
3 10 -10.1 value_2
4 9 -20.4 value_1

pandera (I’m going to spell it as pantera sooner or later) works by defining data schemas for each column that say what values should appear in that column. Let’s create a schema for this dataframe:

# define schema
schema = pa.DataFrameSchema(
    {
        "column1": pa.Column(int, checks=pa.Check.le(10)),
        "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
        "column3": pa.Column(
            str,
            checks=[
                pa.Check.str_startswith("value_"),
                # define custom checks as functions that take a series as input and
                # outputs a boolean or boolean Series
                pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2),
            ],
        ),
    }
)

What did we ask for here? That:

  • column 1 should be an integer that is less than or equal to 10

  • column 2 should be a continuous variable that’s less than 1.2

  • column 3 should be a string that begins ‘value_’ and has a custom check on it that it can be split into chunks based on the single appearance of an underscore.

Alright, let’s see what happens when we put well-behaved data through the schema.

validated_df = schema(df)
print(validated_df)
   column1  column2  column3
0        1     -1.3  value_1
1        4     -1.4  value_2
2        0     -2.9  value_3
3       10    -10.1  value_2
4        9    -20.4  value_1

The data comes through the schema without incident because the data satisfies all the checks. Now, what about some data that aren’t well-behaved, but in a single column (here 1).

# data to validate
bad_df = pd.DataFrame({
    "column1": [11, 4],
    "column2": [-1.3, -1.4],
    "column3": ["value_1", "value_2"],
})
schema(bad_df)
---------------------------------------------------------------------------
SchemaError                               Traceback (most recent call last)
Cell In[6], line 6
      1 bad_df = pd.DataFrame({
      2     "column1": [11, 4],
      3     "column2": [-1.3, -1.4],
      4     "column3": ["value_1", "value_2"],
      5 })
----> 6 schema(bad_df)
...
...
SchemaError:  failed element-wise validator 0:

failure cases:
   index  failure_case
0      0            11

The error message is rather verbose but scroll to the end and we see the cause of it: a ‘schema error’. Not only that but we get a report about what failed (column 1), what check it was (0), and even what value triggered the error (11).

Use schema.validate(df, lazy=True) to get all of the errors rather than just the first one to be triggered.

You can also pass hypothesis tests into a schema as a check; there’s a nice example of using a two-sample t-test in the documentation.

Data checks as a class#

The ‘dataframe’ style of checks is not the only way to specify a schema. If you’re au fait with classes, you can also specify checks that way. Here’s the same example in that form:

from pandera.typing import Series


class Schema(pa.SchemaModel):

    column1: Series[int] = pa.Field(le=10)
    column2: Series[float] = pa.Field(lt=-1.2)
    column3: Series[str] = pa.Field(str_startswith="value_")

    @pa.check("column3")
    def column_3_check(cls, series: Series[str]) -> Series[bool]:
        """Check that column3 values have two elements after being split with '_'"""
        return series.str.split("_", expand=True).shape[1] == 2


Schema.validate(df)
column1 column2 column3
0 1 -1.3 value_1
1 4 -1.4 value_2
2 0 -2.9 value_3
3 10 -10.1 value_2
4 9 -20.4 value_1

Again, this passes but again it would not have with bad_df as defined earlier.

It doesn’t have to be a single column that you’re checking at a time, pandera can also check data that is the subject of groupby operations. Here’s an example where data are grouped by a group column and then the means of the height column are compared:

df = pd.DataFrame(
    {
        "height": [5.6, 6.4, 4.0, 7.1],
        "group": ["A", "B", "A", "B"],
    }
)

schema = pa.DataFrameSchema(
    {
        "height": pa.Column(
            pa.Float, pa.Check(lambda g: g["A"].mean() < g["B"].mean(), groupby="group")
        ),
        "group": pa.Column(pa.String),
    }
)

schema.validate(df)
height group
0 5.6 A
1 6.4 B
2 4.0 A
3 7.1 B

Data checks for pipelines#

You can also specify the validations as function decorators that check the data as it enters (check_input) or exits (check_output) a function (or both, with check_io). This could be really useful as part of a larger pipeline of data processes. Here’s an example (of a dataframe that passes the input checks):

df = pd.DataFrame(
    {
        "column1": [1, 4, 0, 10, 9],
        "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
    }
)

in_schema = pa.DataFrameSchema(
    {
        "column1": pa.Column(
            pa.Int, pa.Check(lambda x: 0 <= x <= 10, element_wise=True)
        ),
        "column2": pa.Column(pa.Float, pa.Check(lambda x: x < -1.2)),
    }
)


@pa.check_input(in_schema)
def preprocessor(dataframe):
    dataframe["column3"] = dataframe["column1"] + dataframe["column2"]
    return dataframe


preprocessed_df = preprocessor(df)
preprocessed_df
column1 column2 column3
0 1 -1.3 -0.3
1 4 -1.4 2.6
2 0 -2.9 -2.9
3 10 -10.1 -0.1
4 9 -20.4 -11.4

Great Expectations#

Great Expectations is another data validation library. It’s really geared toward to production (and can integrate well with some other production tools). I wouldn’t necessarily recommend it if you’re doing a research project but it’s got some amazing features that just mean it made it into this chapter–namely that it doesn’t just validate data, but it documents and profiles it too.

Their philosophy is that ‘Expectations’, the checks or assertions about data, are unit tests for your data (don’t worry if you’re not familiar with unit tests-they are checks for your code!). Once you have run your expectations, the library can create data documentation and data quality reports from them. The data documentation is quite an amazing feature, providing a navigable HTML report of your dataset.

Using Great Expectations is a bit different from pandera as it replaces your dataframe with a Great Expectations PandasDataset that looks and feels just like a regular pandas dataframe but has extra methods related to data validation. To convert a regular pandas dataframe, you wrap it with ge.from_pandas().

Let’s look at some examples:

import great_expectations as ge

df = ge.from_pandas(
    pd.DataFrame(
        {
            "column1": [1, 4, 0, 10, 9],
            "column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
            "column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
        }
    )
)
df.head()
column1 column2 column3
0 1 -1.3 value_1
1 4 -1.4 value_2
2 0 -2.9 value_3
3 10 -10.1 value_2
4 9 -20.4 value_1

Now let’s run some expectations:

df.expect_column_values_to_be_of_type("column1", "int")
{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}
df.expect_column_proportion_of_unique_values_to_be_between("column3", 3, 5)
{
  "success": false,
  "result": {
    "observed_value": 0.6,
    "element_count": 5,
    "missing_count": null,
    "missing_percent": null
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Note that the first expectation passed, while the second failed but did not throw an error… we were just told in JSON.

Let’s also see a less stringent expectations:

df.expect_column_values_to_be_between(
    column="column2",
    min_value=-15,
    max_value=0,
    mostly=0.80,
)
{
  "success": true,
  "result": {
    "element_count": 5,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 1,
    "unexpected_percent": 20.0,
    "unexpected_percent_total": 20.0,
    "unexpected_percent_nonmissing": 20.0,
    "partial_unexpected_list": [
      -20.4
    ]
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Having a bunch of JSON thrown at you is hardly a pleasant experience, and running a few commands solely in a notebook (which is how this was generated) is not really the way that Great Expectations is intended to be used. Instead, it is designed to be used via several commands in the terminal that set up folders and files that will track the expectations you create and produce nice HTML reports on your data and whether it passed your tests or not. There is a full tutorial available on the website.

Pydantic#

Pydantic has some of the same features as pandera but it piggybacks on the ability of Python 3.5+ to have ‘typed’ variables (if you’re not sure what that is, it’s a way to declare a variable has a particular data type from inception) and it is really focused around the validation of objects (created from classes) rather than dataframes.

If you’ve used the SQLModel package for writing SQL queries, you may be interested to know that every SQLModel call is also a Pydantic model.

Here’s an example of a Pydantic schema that also implements a class:

from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel


class User(BaseModel):
    id: int
    name = "Katherine Johnson"
    signup_ts: Optional[datetime] = None
    friends: List[int] = []

We now pass some (acceptable) external data into this schema, creating an object of type User:

external_data = {
    "id": "123",
    "signup_ts": "2019-06-01 12:22",
    "friends": [1, 2, "3"],
}
user = User(**external_data)
user
User(id=123, signup_ts=datetime.datetime(2019, 6, 1, 12, 22), friends=[1, 2, 3], name='Katherine Johnson')

That all worked well (as expected) but what if we now create an object that has ‘bad’ data:

User(signup_ts='broken', friends=[1, 2, 'not number'])
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Cell In[16], line 1
----> 1 User(signup_ts='broken', friends=[1, 2, 'not number'])


ValidationError: 3 validation errors for User
id
  field required (type=value_error.missing)
signup_ts
  invalid datetime format (type=value_error.datetime)
friends -> 2
  value is not a valid integer (type=type_error.integer)

We get a loud failure, a ValidationError, and a short report about what went wrong.

Cerberus#

Our final stop on this tour of data validation is Cerberus. It is designed to be a simple and lightweight data validation functionality. In this case, the schema is specified as a JSON file:

schema = {"name": {"type": "string"}, "score": {"type": "integer", "max": 10}}

Then we pass data in to a Cerberus validator object along with the schema. This will return true for good data but for bad data…

from cerberus import Validator

v = Validator()
data = {"name": "Sadie Alexander", "score": 12, "occupation": "Economist"}
v.validate(data, schema)
False

This is useful to know but we might want a bit more detail:

v.errors
{'occupation': ['unknown field'], 'score': ['max value is 10']}

Much more complex schemas can be constructed.

Fake and Synthetic Data#

Fake data can be very useful for testing pipelines and trying out analyses before you have the real data. The closer fake data is to real data, the more likely that you’re going to fully prepare yourself for the real data.

Fake data are data generated according to a schema that bear no statistical relationship to the real data. There are powerful tools for generating fake data. One of the most popular and fast libraries for doing so is Mimesis.

Synthetic data take this one step further: they do capture some of the statistical properties of the underlying data and are generated from the underlying data. Again, they can be useful for trying out data before using the real thing—especially if the real data are highly confidential. A useful package for comparing real and synthetic data is SynthGuage.

Data Orchestration#

Data orchestration is the automation of data-driven processes from end-to-end, including preparing data, making decisions based on that data, and taking actions based on those decisions. Think of a data pipeline in which you extract data (perhaps from files), transform it somehow, and load it into where you want to put it (downstream in your research process, or perhaps into an app). Sometimes these processes are called ETL, for extract, transform, and load. Some of their benefits include reproducibility, and being able to run tasks on a schedule—whether you are at your computer or not, and (if you’re using them in the cloud), whether your computer is even on or not! They are dreamboats for automation.

A lot of them visualise the tasks they have to carry out as a DAG, or Directed Acyclic Graph. This defines how different tasks depend on each other in order, with one task following from the previous one (or perhaps following from multiple previous tasks). When they carry out a DAG of tasks on an event or at a scheduled time, they create a log of the execution of the steps too—so if something goes wrong, you can always look back at it. However, they are a layer of abstraction between you and the data, and this does make them a more advanced tool.

Going into further details of data orchestration is outside of the scope of this book, so we won’t be seeing a practical example, but we think it’s important enough to mention it and to point you to more resources.

There are some truly amazing tools out there to help you do this on a production scale. Broadly, there is a 1st generation of tools that are excellent but might have more trouble with tasks such as:

  • local development, testing, and storage abstractions

  • one-off and irregularly scheduled tasks

  • the movement of data between tasks

  • dynamic and parameterised workflows

And there is a 2nd generation of tools that tries to solve some of these issues.

Of the 1st generation, by far the best known and most suited for production is Apache’s (original AirBnB’s) Airflow. There’s a less mature 1st generation tool from Spotify called Luigi too, but Airflow is widely used in the tech industry, and doesn’t just schedule data processes in Python: it can run processes in pretty much whatever you like. Both of these tools try to solve the ‘plumbing’ associated with long-running batch processes on data: chaining tasks, automating them, dealing with failures, and scheduling. Both Luigi and Airflow have fancy interfaces to show you what’s going on with your tasks.

Tools in the 2nd generation include the easy-to-use Prefect, “modern workflow orchestration for data and ML engineers”, and dagster, which bills itself as “a data orchestrator for machine learning, analytics, and ETL [extract, transform, and load]”. These have visualisation interfaces too, can run locally or on the cloud, and are more flexible about triggers for events.

For a research project where you only need to execute your DAG once, and you’re probably much more concerned about the costliness of each step than anything else, these tools are likely to be overkill or not quite what you need. But if you’re automating an ETL process, they’re perfect (prefect). The best place to get started is by looking at their webpages and googling around for tutorials.