Features

Skimpy provides:

a way to create summary statistics of pandas or Polars dataframes, using the skim() function, and print them to your console via the rich package
support for summarising boolean, numeric, datetime, timedelta, string, and category datatypes
a command line interface to skim csv files
intelligent rounding of numerical values to 4 significant figures
a way to export the visual summary statistics to lossless formats namely SVG or HTML
a way to further work with the summary statistics, by returning them as a dictionary
a way to clean up messy column names in both pandas and Polars dataframes

When using skimpy, please be aware that numerical columns are rounded to 4 significant figures. You should also be aware that any timezone-aware datetimes are converted into their naive equivalents.

You can find a full guide to the API on the reference pages.

Skim a dataframe and return the statistics

To use skim() in its default mode, see the quickstart on the homepage.

If you want to export your results to a dictionary within Python, rather than printing them to console, use the skim_get_data() function instead. Let’s see an example:

import pandas as pd
from rich import print
from skimpy import generate_test_data, skim_get_data

df = generate_test_data()

summary = skim_get_data(df)

And the dictionary has contents as follows:

print(summary)

{
    'Data Summary': {'Number of rows': 1000, 'Number of columns': 13},
    'Data Types': {
        'float64': 3,
        'category': 2,
        'object': 2,
        'datetime64[ns]': 2,
        'bool': 1,
        'int64': 1,
        'string': 1,
        'timedelta64[ns]': 1
    },
    'Categories': {'Columns': {'class', 'location'}},
    'number': {
        'NA': {'length': 0, 'width': 0, 'depth': 0, 'rnd': 118},
        'NA %': {'length': 0.0, 'width': 0.0, 'depth': 0.0, 'rnd': 11.8},
        'mean': {'length': 0.5016, 'width': 2.037, 'depth': 10.02, 'rnd': -0.01977},
        'sd': {'length': 0.3597, 'width': 1.929, 'depth': 3.208, 'rnd': 1.002},
        'p0': {'length': 1.573e-06, 'width': 0.002057, 'depth': 2.0, 'rnd': -2.809},
        'p25': {'length': 0.134, 'width': 0.603, 'depth': 8.0, 'rnd': -0.7355},
        'p50': {'length': 0.4976, 'width': 1.468, 'depth': 10.0, 'rnd': -0.0007736},
        'p75': {'length': 0.8602, 'width': 2.953, 'depth': 12.0, 'rnd': 0.6639},
        'p100': {'length': 1.0, 'width': 13.91, 'depth': 20.0, 'rnd': 3.717},
        'hist': {'length': '█▃▃▃▄█', 'width': '█▃▁   ', 'depth': '▁▄█▆▃▁', 'rnd': '▁▄█▅▁ '}
    },
    'category': {
        'NA': {'class': 0, 'location': 1},
        'NA %': {'class': 0.0, 'location': 0.1},
        'ordered': {'class': False, 'location': False},
        'unique': {'class': 2, 'location': 5}
    },
    'bool': {'true': {'booly_col': 516}, 'true rate': {'booly_col': 0.52}, 'hist': {'booly_col': '█    █'}},
    'datetime': {
        'NA': {'datetime': 0, 'datetime_no_freq': 3},
        'NA %': {'datetime': 0.0, 'datetime_no_freq': 0.3},
        'first': {
            'datetime': Timestamp('2018-01-31 00:00:00'),
            'datetime_no_freq': Timestamp('1992-01-05 00:00:00')
        },
        'last': {
            'datetime': Timestamp('2101-04-30 00:00:00'),
            'datetime_no_freq': Timestamp('2023-03-04 00:00:00')
        },
        'frequency': {'datetime': 'ME', 'datetime_no_freq': None}
    },
    "<class 'datetime.date'>": {
        'NA': {'datetime.date': 0, 'datetime.date_no_freq': 0},
        'NA %': {'datetime.date': 0.0, 'datetime.date_no_freq': 0.0},
        'first': {
            'datetime.date': datetime.date(2018, 1, 31),
            'datetime.date_no_freq': datetime.date(1992, 1, 5)
        },
        'last': {'datetime.date': datetime.date(2101, 4, 30), 'datetime.date_no_freq': datetime.date(2023, 3, 4)},
        'frequency': {'datetime.date': 'ME', 'datetime.date_no_freq': None}
    },
    'timedelta64[ns]': {
        'NA': {'time diff': 5},
        'NA %': {'time diff': 0.5},
        'mean': {'time diff': Timedelta('8 days 00:05:47')},
        'median': {'time diff': Timedelta('0 days 00:00:00')},
        'max': {'time diff': Timedelta('26 days 00:00:00')}
    },
    'string': {
        'NA': {'text': 6},
        'NA %': {'text': 0.6},
        'shortest': {'text': 'How are you?'},
        'longest': {'text': 'Indeed, it was the most outrageously pompous cat I have ever seen.'},
        'min': {'text': 'How are you?'},
        'max': {'text': 'What weather!'},
        'chars per row': {'text': 31.1},
        'words per row': {'text': 5.8},
        'total words': {'text': 5761}
    },
    'object': {
        'NA': {'datetime.date': 0, 'datetime.date_no_freq': 0},
        'NA %': {'datetime.date': 0.0, 'datetime.date_no_freq': 0.0}
    }
}

Clean up messy dataframe column names

skimpy also comes with a clean_columns function as a convenience (with thanks to the dataprep package). This slugifies column names in pandas dataframes. For example,

from skimpy import clean_columns

columns = [
    "bs lncs;n edbn ",
    "Nín hǎo. Wǒ shì zhōng guó rén",
    "___This is a test___",
    "ÜBER Über German Umlaut",
]
messy_df = pd.DataFrame(columns=columns, index=[0], data=[range(len(columns))])
print("Column names:")
print(list(messy_df.columns))

Column names:

['bs lncs;n edbn ', 'Nín hǎo. Wǒ shì zhōng guó rén', '___This is a test___', 'ÜBER Über German Umlaut']

Now let’s clean these—by default what we get back is in snake case:

clean_df = clean_columns(messy_df)
print(list(clean_df.columns))

['bs_lncs_n_edbn', 'nin_hao_wo_shi_zhong_guo_ren', 'this_is_a_test', 'uber_uber_german_umlaut']

Other naming conventions are available, for example camel case:

clean_df = clean_columns(messy_df, case="camel")
print(list(clean_df.columns))

['bsLncsNEdbn', 'ninHaoWoShiZhongGuoRen', 'thisIsATest', 'uberUberGermanUmlaut']

Export the visual summary table to SVG

To export the figure containing the table of summary statistics, use the skim_get_figure() function. This will save an SVG file to the given (relative) path that you pass with the save_path argument.

Run skim on a csv file from the command line

Although it’s usually better to set datatypes before running skimpy on data, we provide a command line utility that can work with CSV files as a convenience.

You can run this with the below—but note that the command is skimpy, the name of the package, rather than skim, as in the Python function.

$ skimpy file.csv