import pandas as pd
from rich import print
from skimpy import generate_test_data, skim_get_data
= generate_test_data()
df
= skim_get_data(df) summary
Features
Skimpy provides:
- a way to create summary statistics of pandas or Polars dataframes, using the
skim()
function, and print them to your console via the rich package - support for summarising boolean, numeric, datetime, timedelta, string, and category datatypes
- a command line interface to
skim
csv files - intelligent rounding of numerical values to 4 significant figures
- a way to export the visual summary statistics to lossless formats namely SVG or HTML
- a way to further work with the summary statistics, by returning them as a dictionary
- a way to clean up messy column names in both pandas and Polars dataframes
When using skimpy, please be aware that numerical columns are rounded to 4 significant figures.
You can find a full guide to the API on the reference pages.
Skim a dataframe and return the statistics
To use skim()
in its default mode, see the quickstart on the homepage.
If you want to export your results to a dictionary within Python, rather than printing them to console, use the skim_get_data()
function instead. Let’s see an example:
And the dictionary has contents as follows:
print(summary)
{ 'Data Summary': {'Number of rows': 1000, 'Number of columns': 13}, 'Data Types': { 'float64': 3, 'category': 2, 'datetime64[ns]': 2, 'object': 2, 'int64': 1, 'bool': 1, 'string': 1, 'timedelta64[ns]': 1 }, 'Categories': {'Columns': {'location', 'class'}}, 'number': { 'NA': {'length': 0, 'width': 0, 'depth': 0, 'rnd': 118}, 'NA %': {'length': 0.0, 'width': 0.0, 'depth': 0.0, 'rnd': 11.8}, 'mean': {'length': 0.5016, 'width': 2.037, 'depth': 10.02, 'rnd': -0.01977}, 'sd': {'length': 0.3597, 'width': 1.929, 'depth': 3.208, 'rnd': 1.002}, 'p0': {'length': 1.573e-06, 'width': 0.002057, 'depth': 2.0, 'rnd': -2.809}, 'p25': {'length': 0.134, 'width': 0.603, 'depth': 8.0, 'rnd': -0.7355}, 'p50': {'length': 0.4976, 'width': 1.468, 'depth': 10.0, 'rnd': -0.0007736}, 'p75': {'length': 0.8602, 'width': 2.953, 'depth': 12.0, 'rnd': 0.6639}, 'p100': {'length': 1.0, 'width': 13.91, 'depth': 20.0, 'rnd': 3.717}, 'hist': {'length': '▇▃▃▃▅▇', 'width': '▇▃▁ ', 'depth': '▁▃▇▆▃▁', 'rnd': '▁▅▇▅▁ '} }, 'category': { 'NA': {'class': 0, 'location': 1}, 'NA %': {'class': 0.0, 'location': 0.1}, 'ordered': {'class': False, 'location': False}, 'unique': {'class': 2, 'location': 5} }, 'bool': {'true': {'booly_col': 516}, 'true rate': {'booly_col': 0.52}, 'hist': {'booly_col': '▇ ▇'}}, 'datetime': { 'NA': {'datetime': 0, 'datetime_no_freq': 3}, 'NA %': {'datetime': 0.0, 'datetime_no_freq': 0.3}, 'first': { 'datetime': Timestamp('2018-01-31 00:00:00'), 'datetime_no_freq': Timestamp('1992-01-05 00:00:00') }, 'last': { 'datetime': Timestamp('2101-04-30 00:00:00'), 'datetime_no_freq': Timestamp('2023-03-04 00:00:00') }, 'frequency': {'datetime': 'ME', 'datetime_no_freq': None} }, "<class 'datetime.date'>": { 'NA': {'datetime.date': 0, 'datetime.date_no_freq': 0}, 'NA %': {'datetime.date': 0.0, 'datetime.date_no_freq': 0.0}, 'first': { 'datetime.date': datetime.date(2018, 1, 31), 'datetime.date_no_freq': datetime.date(1992, 1, 5) }, 'last': {'datetime.date': datetime.date(2101, 4, 30), 'datetime.date_no_freq': datetime.date(2023, 3, 4)}, 'frequency': {'datetime.date': 'ME', 'datetime.date_no_freq': None} }, 'timedelta64[ns]': { 'NA': {'time diff': 5}, 'NA %': {'time diff': 0.5}, 'mean': {'time diff': Timedelta('8 days 00:05:47')}, 'median': {'time diff': Timedelta('0 days 00:00:00')}, 'max': {'time diff': Timedelta('26 days 00:00:00')} }, 'string': { 'NA': {'text': 6}, 'NA %': {'text': 0.6}, 'words per row': {'text': 5.8}, 'total words': {'text': 5761} } }
Clean up messy dataframe column names
skimpy also comes with a clean_columns
function as a convenience (with thanks to the dataprep package). This slugifies column names in pandas dataframes. For example,
from skimpy import clean_columns
= [
columns "bs lncs;n edbn ",
"Nín hǎo. Wǒ shì zhōng guó rén",
"___This is a test___",
"ÜBER Über German Umlaut",
]= pd.DataFrame(columns=columns, index=[0], data=[range(len(columns))])
messy_df print("Column names:")
print(list(messy_df.columns))
Column names:
['bs lncs;n edbn ', 'Nín hǎo. Wǒ shì zhōng guó rén', '___This is a test___', 'ÜBER Über German Umlaut']
Now let’s clean these—by default what we get back is in snake case:
= clean_columns(messy_df)
clean_df print(list(clean_df.columns))
['bs_lncs_n_edbn', 'nin_hao_wo_shi_zhong_guo_ren', 'this_is_a_test', 'uber_uber_german_umlaut']
Other naming conventions are available, for example camel case:
= clean_columns(messy_df, case="camel")
clean_df print(list(clean_df.columns))
['bsLncsNEdbn', 'ninHaoWoShiZhongGuoRen', 'thisIsATest', 'uberUberGermanUmlaut']
Export the visual summary table to SVG
To export the figure containing the table of summary statistics, use the skim_get_figure()
function. This will save an SVG file to the given (relative) path that you pass with the save_path
argument.
Run skim on a csv file from the command line
Although it’s usually better to set datatypes before running skimpy on data, we provide a command line utility that can work with CSV files as a convenience.
You can run this with the below—but note that the command is skimpy
, the name of the package, rather than skim
, as in the Python function.
$ skimpy file.csv