Features

Skimpy provides:

When using skimpy, please be aware that numerical columns are rounded to 4 significant figures. You should also be aware that any timezone-aware datetimes are converted into their naive equivalents.

You can find a full guide to the API on the reference pages.

Skim a dataframe and return the statistics

To use skim() in its default mode, see the quickstart on the homepage.

If you want to export your results to a dictionary within Python, rather than printing them to console, use the skim_get_data() function instead. Let’s see an example:

import pandas as pd
from rich import print
from skimpy import generate_test_data, skim_get_data

df = generate_test_data()

summary = skim_get_data(df)

And the dictionary has contents as follows:

print(summary)
{
    'Data Summary': {'Number of rows': 1000, 'Number of columns': 13},
    'Data Types': {
        'float64': 3,
        'category': 2,
        'object': 2,
        'datetime64[ns]': 2,
        'bool': 1,
        'int64': 1,
        'string': 1,
        'timedelta64[ns]': 1
    },
    'Categories': {'Columns': {'class', 'location'}},
    'number': {
        'NA': {'length': 0, 'width': 0, 'depth': 0, 'rnd': 118},
        'NA %': {'length': 0.0, 'width': 0.0, 'depth': 0.0, 'rnd': 11.8},
        'mean': {'length': 0.5016, 'width': 2.037, 'depth': 10.02, 'rnd': -0.01977},
        'sd': {'length': 0.3597, 'width': 1.929, 'depth': 3.208, 'rnd': 1.002},
        'p0': {'length': 1.573e-06, 'width': 0.002057, 'depth': 2.0, 'rnd': -2.809},
        'p25': {'length': 0.134, 'width': 0.603, 'depth': 8.0, 'rnd': -0.7355},
        'p50': {'length': 0.4976, 'width': 1.468, 'depth': 10.0, 'rnd': -0.0007736},
        'p75': {'length': 0.8602, 'width': 2.953, 'depth': 12.0, 'rnd': 0.6639},
        'p100': {'length': 1.0, 'width': 13.91, 'depth': 20.0, 'rnd': 3.717},
        'hist': {'length': '█▃▃▃▄█', 'width': '█▃▁   ', 'depth': '▁▄█▆▃▁', 'rnd': '▁▄█▅▁ '}
    },
    'category': {
        'NA': {'class': 0, 'location': 1},
        'NA %': {'class': 0.0, 'location': 0.1},
        'ordered': {'class': False, 'location': False},
        'unique': {'class': 2, 'location': 5}
    },
    'bool': {'true': {'booly_col': 516}, 'true rate': {'booly_col': 0.52}, 'hist': {'booly_col': '█    █'}},
    'datetime': {
        'NA': {'datetime': 0, 'datetime_no_freq': 3},
        'NA %': {'datetime': 0.0, 'datetime_no_freq': 0.3},
        'first': {
            'datetime': Timestamp('2018-01-31 00:00:00'),
            'datetime_no_freq': Timestamp('1992-01-05 00:00:00')
        },
        'last': {
            'datetime': Timestamp('2101-04-30 00:00:00'),
            'datetime_no_freq': Timestamp('2023-03-04 00:00:00')
        },
        'frequency': {'datetime': 'ME', 'datetime_no_freq': None}
    },
    "<class 'datetime.date'>": {
        'NA': {'datetime.date': 0, 'datetime.date_no_freq': 0},
        'NA %': {'datetime.date': 0.0, 'datetime.date_no_freq': 0.0},
        'first': {
            'datetime.date': datetime.date(2018, 1, 31),
            'datetime.date_no_freq': datetime.date(1992, 1, 5)
        },
        'last': {'datetime.date': datetime.date(2101, 4, 30), 'datetime.date_no_freq': datetime.date(2023, 3, 4)},
        'frequency': {'datetime.date': 'ME', 'datetime.date_no_freq': None}
    },
    'timedelta64[ns]': {
        'NA': {'time diff': 5},
        'NA %': {'time diff': 0.5},
        'mean': {'time diff': Timedelta('8 days 00:05:47')},
        'median': {'time diff': Timedelta('0 days 00:00:00')},
        'max': {'time diff': Timedelta('26 days 00:00:00')}
    },
    'string': {
        'NA': {'text': 6},
        'NA %': {'text': 0.6},
        'shortest': {'text': 'How are you?'},
        'longest': {'text': 'Indeed, it was the most outrageously pompous cat I have ever seen.'},
        'min': {'text': 'How are you?'},
        'max': {'text': 'What weather!'},
        'chars per row': {'text': 31.1},
        'words per row': {'text': 5.8},
        'total words': {'text': 5761}
    },
    'object': {
        'NA': {'datetime.date': 0, 'datetime.date_no_freq': 0},
        'NA %': {'datetime.date': 0.0, 'datetime.date_no_freq': 0.0}
    }
}

Clean up messy dataframe column names

skimpy also comes with a clean_columns function as a convenience (with thanks to the dataprep package). This slugifies column names in pandas dataframes. For example,

from skimpy import clean_columns

columns = [
    "bs lncs;n edbn ",
    "Nín hǎo. Wǒ shì zhōng guó rén",
    "___This is a test___",
    "ÜBER Über German Umlaut",
]
messy_df = pd.DataFrame(columns=columns, index=[0], data=[range(len(columns))])
print("Column names:")
print(list(messy_df.columns))
Column names:
['bs lncs;n edbn ', 'Nín hǎo. Wǒ shì zhōng guó rén', '___This is a test___', 'ÜBER Über German Umlaut']

Now let’s clean these—by default what we get back is in snake case:

clean_df = clean_columns(messy_df)
print(list(clean_df.columns))
['bs_lncs_n_edbn', 'nin_hao_wo_shi_zhong_guo_ren', 'this_is_a_test', 'uber_uber_german_umlaut']

Other naming conventions are available, for example camel case:

clean_df = clean_columns(messy_df, case="camel")
print(list(clean_df.columns))
['bsLncsNEdbn', 'ninHaoWoShiZhongGuoRen', 'thisIsATest', 'uberUberGermanUmlaut']

Export the visual summary table to SVG

To export the figure containing the table of summary statistics, use the skim_get_figure() function. This will save an SVG file to the given (relative) path that you pass with the save_path argument.

Run skim on a csv file from the command line

Although it’s usually better to set datatypes before running skimpy on data, we provide a command line utility that can work with CSV files as a convenience.

You can run this with the below—but note that the command is skimpy, the name of the package, rather than skim, as in the Python function.

$ skimpy file.csv