Advanced Data
Contents
Advanced Data#
Note
If you’re just starting to code, you can safely skip this chapter.
Introduction#
In this chapter, you’ll learn about advanced methods and tolls for working with data, including data validation, data orchestration, and some alternatives to the pandas package.
Imports#
First we need to import the classic packages we almost always need:
import os
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Alternatives to pandas#
pandas isn’t the only game in town, not by a long way—though it’s by far the best supported and the most fully featured. pandas is a good default package to use for data in the vast majority of cases. But it’s always good to have options—or, put another way, options have value! Other dataframe libraries may have a syntax that you prefer (datatable’s syntax is popular with some) or provide a speed-up in certain situations, for example when working with large datasets (polars promises to breeze through big data far faster than pandas can).
If you’re specifically interested in how different dataframe options perform on increasingly large datasets, take a look at the benchmarks here.
Here’s a quick run-through of some alternatives to pandas:
Datatable#
datatable is another dataframe based way to do analysis, and it has quite a different syntax to pandas for data manipulation. datatable is very fast, not the fastest dataframe option out there, but it holds its own. Its other major advantage is that it is comfortable running on extremely large (think 50GB) datasets. Most other dataframe packages (including in other languages) cannot cope with this.
In datatable, almost all operations are achieved via so-called ‘square-bracket notation’. Operations with a (data) Frame are almost all expressed in the form
DT[i, j, ...]
where DT
is a datatable dataframe, i
is the row selector, j
is the column selector, and ...
is a space that could be filled by other commands (potentially several). However, i
and j
aren’t just positional, as they might be in a numpy array or a list, they do a lot more than that. Datatable allows i
to be anything that can conceivably be interpreted as a row selector: an integer to select a row, a slice, a range, a list of integers, a list of slices, an expression, a boolean-valued Frame, an integer-valued Frame, an integer numpy array, a generator, and more.
Likewise, j
is also really versatile. It can be used to select a column by name or position, but it will also accept a list of columns, a slice, a list of booleans indicating which columns to pick, an expression, a list of expressions, and a dictionary of expressions. The j expression can even be a python type (such as int or dt.float32), selecting all columns matching that type!
To change an entry (entries), you can use DT[i, j] = new_value
, and to remove it (or them), it’s del DT[i, j]
.
cuDF#
cuDF is still under development and so doesn’t yet have all the features of pandas. It’s a Python GPU DataFrame library built on the blisteringly fast Apache Arrow columnar memory format. The parts of the library that have been implemented follow the same commands and structure as pandas, so it should be easy to use it should you need to.
cuDF is one of the fastest dataframe library out there in any of Python, Julia, Rust, or R with the caveats that:
this only applies to in-memory datasets, ie datasets smaller than a few GB
it doesn’t yet do everything
it’s only currently available on the Linux operating system
you need a GPU (and associated software) to use it!
Polars#
Polars is one of the fastest dataframe libraries in any language. It also uses Apache Arrow as backend. It currently consists of an ‘eager’ (for datasets smaller than approximately a few GB) interface that’s very similar to pandas and a ‘lazy’ interface (don’t worry if you don’t know what that means, it’s a big data thing) that is somewhat similar to spark (a big data tool). Polars is built on the Rust language. It’s particularly effective at merging datasets.
Plydata#
Plydata uses the syntax of the dplyr package in the R statistical language (the package is a really important part of the data ecosystem in R). It’s built around method chaining and everything being in a tidy format, which has pros and cons. Largely, the benefits are in readability and a smaller number of commands to cover the ground that you need to. The downsides are that it plays best with tidy data and when chained methods go wrong it can be more difficult to find out what the issue is.
Datar#
Another library that gets close to dplylr syntax is datar. Like plydata, it integrates with plotnine for visualisation.
tidypolars#
tidypolars combines the syntax of dplyr with the best-in-class speed (for any language, as of the time of writing) of Python package polars. Here’s an example of a typical command to give you a flavour of what the syntax looks like:
import tidypolars as tp
from tidypolars import col, desc
df = tp.Tibble(x = range(3), y = range(3, 6), z = ['a', 'a', 'b'])
(
df
.select('x', 'y', 'z')
.filter(col('x') < 4, col('y') > 1)
.arrange(desc('z'), 'x')
.mutate(double_x = col('x') * 2,
x_plus_y = col('x') + col('y'))
)
┌─────┬─────┬─────┬──────────┬──────────┐
│ x ┆ y ┆ z ┆ double_x ┆ x_plus_y │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ str ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪══════════╪══════════╡
│ 2 ┆ 5 ┆ b ┆ 4 ┆ 7 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 3 ┆ a ┆ 0 ┆ 3 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 4 ┆ a ┆ 2 ┆ 5 │
└─────┴─────┴─────┴──────────┴──────────┘
Others#
Data Validation#
When working with data, it’s useful to have a way to validate that what you’re getting is what you think you’re getting. And so various data validation methods have appeared that will fail loudly and proudly when your data are not as expected. Although most useful in a situation where data are being regularly updated, for example when producing a monthly index, these tools can also be useful if you want to be extra careful with your data in a research paper, if you are scaling up from a sample to a bigger dataset, or if you are updating an existing project.
Of course, the most minimal way to do some validation is to ensure that your dataframe has its columns cast as the right data types from the start, and this is good practice in general. That’s about data type, but these tools go beyond that and deal with values within a data type. So rather than just saying it’s an integer, a data validation library might say that a column must be an integer between 0 and 100.
The normal way that data validation libraries are used is that you specify some expectations you have of the dataset that will be flagged if and when they fail. Let’s see some of the tools that are available that do this. As ever, this chapter has benefitted enormously from the documentation of the featured packages.
Pandera#
Pandera Niels Bantilan [2020] has tight integration with pandas (as the name suggests). There are several different ways to specify data validation checks.
Dataframe style checks#
It’s easiest to show how it works with an example. We’ll first create some data that we’d like to validate.
import pandera as pa
# data to validate
df = pd.DataFrame(
{
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
}
)
df
column1 | column2 | column3 | |
---|---|---|---|
0 | 1 | -1.3 | value_1 |
1 | 4 | -1.4 | value_2 |
2 | 0 | -2.9 | value_3 |
3 | 10 | -10.1 | value_2 |
4 | 9 | -20.4 | value_1 |
pandera (I’m going to spell it as pantera sooner or later) works by defining data schemas for each column that say what values should appear in that column. Let’s create a schema for this dataframe:
# define schema
schema = pa.DataFrameSchema(
{
"column1": pa.Column(int, checks=pa.Check.le(10)),
"column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
"column3": pa.Column(
str,
checks=[
pa.Check.str_startswith("value_"),
# define custom checks as functions that take a series as input and
# outputs a boolean or boolean Series
pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2),
],
),
}
)
What did we ask for here? That:
column 1 should be an integer that is less than or equal to 10
column 2 should be a continuous variable that’s less than 1.2
column 3 should be a string that begins ‘value_’ and has a custom check on it that it can be split into chunks based on the single appearance of an underscore.
Alright, let’s see what happens when we put well-behaved data through the schema.
validated_df = schema(df)
print(validated_df)
column1 column2 column3
0 1 -1.3 value_1
1 4 -1.4 value_2
2 0 -2.9 value_3
3 10 -10.1 value_2
4 9 -20.4 value_1
The data comes through the schema without incident because the data satisfies all the checks. Now, what about some data that aren’t well-behaved, but in a single column (here 1).
# data to validate
bad_df = pd.DataFrame({
"column1": [11, 4],
"column2": [-1.3, -1.4],
"column3": ["value_1", "value_2"],
})
schema(bad_df)
SchemaError Traceback (most recent call last)
<ipython-input-6-ff71778da6fd> in <module>
5 "column3": ["value_1", "value_2"],
6 })
----> 7 schema(bad_df)
...
...
SchemaError: <Schema Column: 'column1' type=<class 'int'>> failed element-wise validator 0:
<Check less_than_or_equal_to: less_than_or_equal_to(10)>
failure cases:
index failure_case
0 0 11
The error message is rather verbose but scroll to the end and we see the cause of it: a ‘schema error’. Not only that but we get a report about what failed (column 1), what check it was (0), and even what value triggered the error (11).
Use schema.validate(df, lazy=True)
to get all of the errors rather than just the first one to be triggered.
You can also pass hypothesis tests into a schema as a check; there’s a nice example of using a two-sample t-test in the documentation.
Data checks as a class#
The ‘dataframe’ style of checks is not the only way to specify a schema. If you’re au fait with classes, you can also specify checks that way. Here’s the same example in that form:
from pandera.typing import Series
class Schema(pa.SchemaModel):
column1: Series[int] = pa.Field(le=10)
column2: Series[float] = pa.Field(lt=-1.2)
column3: Series[str] = pa.Field(str_startswith="value_")
@pa.check("column3")
def column_3_check(cls, series: Series[str]) -> Series[bool]:
"""Check that column3 values have two elements after being split with '_'"""
return series.str.split("_", expand=True).shape[1] == 2
Schema.validate(df)
column1 | column2 | column3 | |
---|---|---|---|
0 | 1 | -1.3 | value_1 |
1 | 4 | -1.4 | value_2 |
2 | 0 | -2.9 | value_3 |
3 | 10 | -10.1 | value_2 |
4 | 9 | -20.4 | value_1 |
Again, this passes but again it would not have with bad_df
as defined earlier.
It doesn’t have to be a single column that you’re checking at a time, pandera can also check data that is the subject of groupby
operations. Here’s an example where data are grouped by a group
column and then the means of the height
column are compared:
df = pd.DataFrame(
{
"height": [5.6, 6.4, 4.0, 7.1],
"group": ["A", "B", "A", "B"],
}
)
schema = pa.DataFrameSchema(
{
"height": pa.Column(
pa.Float, pa.Check(lambda g: g["A"].mean() < g["B"].mean(), groupby="group")
),
"group": pa.Column(pa.String),
}
)
schema.validate(df)
height | group | |
---|---|---|
0 | 5.6 | A |
1 | 6.4 | B |
2 | 4.0 | A |
3 | 7.1 | B |
Data checks for pipelines#
You can also specify the validations as function decorators that check the data as it enters (check_input
) or exits (check_output
) a function (or both, with check_io
). This could be really useful as part of a larger pipeline of data processes. Here’s an example (of a dataframe that passes the input checks):
df = pd.DataFrame(
{
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
}
)
in_schema = pa.DataFrameSchema(
{
"column1": pa.Column(
pa.Int, pa.Check(lambda x: 0 <= x <= 10, element_wise=True)
),
"column2": pa.Column(pa.Float, pa.Check(lambda x: x < -1.2)),
}
)
@pa.check_input(in_schema)
def preprocessor(dataframe):
dataframe["column3"] = dataframe["column1"] + dataframe["column2"]
return dataframe
preprocessed_df = preprocessor(df)
preprocessed_df
column1 | column2 | column3 | |
---|---|---|---|
0 | 1 | -1.3 | -0.3 |
1 | 4 | -1.4 | 2.6 |
2 | 0 | -2.9 | -2.9 |
3 | 10 | -10.1 | -0.1 |
4 | 9 | -20.4 | -11.4 |
Great Expectations#
Great Expectations is another data validation library. It’s really geared toward to production (and can integrate well with some other production tools). I wouldn’t necessarily recommend it if you’re doing a research project but it’s got some amazing features that just mean it made it into this chapter–namely that it doesn’t just validate data, but it documents and profiles it too.
Their philosophy is that ‘Expectations’, the checks or assertions about data, are unit tests for your data (don’t worry if you’re not familiar with unit tests-they are checks for your code!). Once you have run your expectations, the library can create data documentation and data quality reports from them. The data documentation is quite an amazing feature, providing a navigable HTML report of your dataset.
Using Great Expectations is a bit different from pandera as it replaces your dataframe with a Great Expectations PandasDataset
that looks and feels just like a regular pandas dataframe but has extra methods related to data validation. To convert a regular pandas dataframe, you wrap it with ge.from_pandas
.
Let’s look at some examples:
import great_expectations as ge
df = ge.from_pandas(
pd.DataFrame(
{
"column1": [1, 4, 0, 10, 9],
"column2": [-1.3, -1.4, -2.9, -10.1, -20.4],
"column3": ["value_1", "value_2", "value_3", "value_2", "value_1"],
}
)
)
df.head()
column1 | column2 | column3 | |
---|---|---|---|
0 | 1 | -1.3 | value_1 |
1 | 4 | -1.4 | value_2 |
2 | 0 | -2.9 | value_3 |
3 | 10 | -10.1 | value_2 |
4 | 9 | -20.4 | value_1 |
Now let’s run some expectations:
df.expect_column_values_to_be_of_type("column1", "int")
{
"success": true,
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"result": {
"observed_value": "int64"
}
}
df.expect_column_proportion_of_unique_values_to_be_between("column3", 3, 5)
{
"success": false,
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"result": {
"observed_value": 0.6,
"element_count": 5,
"missing_count": null,
"missing_percent": null
}
}
Note that the first expectation passed, while the second failed but did not throw an error… we were just told in JSON.
Let’s also see a less stringent expectations:
df.expect_column_values_to_be_between(
column="column2",
min_value=-15,
max_value=0,
mostly=0.80,
)
{
"success": true,
"meta": {},
"exception_info": {
"raised_exception": false,
"exception_traceback": null,
"exception_message": null
},
"result": {
"element_count": 5,
"missing_count": 0,
"missing_percent": 0.0,
"unexpected_count": 1,
"unexpected_percent": 20.0,
"unexpected_percent_total": 20.0,
"unexpected_percent_nonmissing": 20.0,
"partial_unexpected_list": [
-20.4
]
}
}
Having a bunch of JSON thrown at you is hardly a pleasant experience, and running a few commands solely in a notebook (which is how this was generated) is not really the way that Great Expectations is intended to be used. Instead, it is designed to be used via several commands in the terminal that set up folders and files that will track the expectations you create and produce nice HTML reports on your data and whether it passed your tests or not. There is a full tutorial available on the website.
Pydantic#
Pydantic has some of the same features as pandera but it piggybacks on the ability of Python 3.5+ to have ‘typed’ variables (if you’re not sure what that is, it’s a way to declare a variable has a particular data type from inception) and it is really focused around the validation of objects (created from classes) rather than dataframes.
If you’ve used the SQLModel package for writing SQL queries, you may be interested to know that every SQLModel call is also a Pydantic model.
Here’s an example of a Pydantic schema that also implements a class:
from datetime import datetime
from typing import List, Optional
from pydantic import BaseModel
class User(BaseModel):
id: int
name = "Katherine Johnson"
signup_ts: Optional[datetime] = None
friends: List[int] = []
We now pass some (acceptable) external data into this schema, creating an object of type User
:
external_data = {
"id": "123",
"signup_ts": "2019-06-01 12:22",
"friends": [1, 2, "3"],
}
user = User(**external_data)
user
User(id=123, signup_ts=datetime.datetime(2019, 6, 1, 12, 22), friends=[1, 2, 3], name='Katherine Johnson')
That all worked well (as expected) but what if we now create an object that has ‘bad’ data:
User(signup_ts='broken', friends=[1, 2, 'not number'])
---------------------------------------------------------------------------
ValidationError Traceback (most recent call last)
<ipython-input-15-7a13cdd5b253> in <module>
----> 1 User(signup_ts='broken', friends=[1, 2, 'not number'])
~/opt/anaconda3/envs/codeforecon/lib/python3.8/site-packages/pydantic/main.cpython-38-darwin.so in pydantic.main.BaseModel.__init__()
ValidationError: 3 validation errors for User
id
field required (type=value_error.missing)
signup_ts
invalid datetime format (type=value_error.datetime)
friends -> 2
value is not a valid integer (type=type_error.integer)
We get a loud failure, a ValidationError
, and a short report about what went wrong.
Cerberus#
Our final stop on this tour of data validation is Cerberus. It is designed to be a simple and lightweight data validation functionality. In this case, the schema is specified as a JSON file:
schema = {"name": {"type": "string"}, "score": {"type": "integer", "max": 10}}
Then we pass data in to a Cerberus validator object along with the schema. This will return true for good data but for bad data…
from cerberus import Validator
v = Validator()
data = {"name": "Sadie Alexander", "score": 12, "occupation": "Economist"}
v.validate(data, schema)
False
This is useful to know but we might want a bit more detail:
v.errors
{'occupation': ['unknown field'], 'score': ['max value is 10']}
Much more complex schemas can be constructed.
Fake and Synthetic Data#
Fake data can be very useful for testing pipelines and trying out analyses before you have the real data. The closer fake data is to real data, the more likely that you’re going to fully prepare yourself for the real data.
Fake data are data generated according to a schema that bear no statistical relationship to the real data. There are powerful tools for generating fake data. One of the most popular and fast libraries for doing so is Mimesis.
Synthetic data take this one step further: they do capture some of the statistical properties of the underlying data and are generated from the underlying data. Again, they can be useful for trying out data before using the real thing—especially if the real data are highly confidential. A useful package for comparing real and synthetic data is SynthGuage.
Data Orchestration#
Data orchestration is the automation of data-driven processes from end-to-end, including preparing data, making decisions based on that data, and taking actions based on those decisions. Think of a data pipeline in which you extract data (perhaps from files), transform it somehow, and load it into where you want to put it (downstream in your research process, or perhaps into an app).
Going into details of data orchestration is outside of the scope of this book, so we won’t be seeing a practical example, but we think it’s important enough to mention it and to point you to more resources.
There are some truly amazing tools out there to help you do this on a production scale. Perhaps the best known and most suited for production are AirBnB’s Airflow and Spotify’s Luigi. Airflow in particular is widely used in the tech industry, and doesn’t just schedule data processes in Python: it can run processes in pretty much whatever you like. Both of these tools try to solve the ‘plumbing’ associated with long-running batch processes on data: chaining tasks, automating them, dealing with failures, and scheduling. Both Luigi and Airflow have fancy interfaces to show you what’s going on with your tasks.
Data orchestration tools typically have a directed acyclic graph, a DAG, at the heart of how tasks are strung together. This defines how different tasks depend on each other in order, with one task following from the previous one (or perhaps following from multiple previous tasks). It’s an automation dream.
However, for a research project, it’s hard to recommend these two tools as they’re just a bit too powerful; Airflow in particular can do just about anything but has a steep learning curve. So, instead, to show the power of data orchestration we’ll use a more lightweight but also very powerful library: dagster, which bills itself as ‘a data orchestrator for machine learning, analytics, and ETL [extract, transform, and load]’. Some of the key features are being able to implement components in various tools, such as Pandas and SQL, define pipelines as DAGs, and to test the same setup on your machine that you then deploy to cloud. Like the other tools, it has a nice visual interface to show what’s happening.