8. Data Import#

8.1. Introduction#

In this chapter, we’re going to introduce working with your own data. Here, you’ll learn how to read plain-text rectangular files into Python. We’ll only scratch the surface of data import, but many of the principles will translate to other forms of data. We’ll finish with a few pointers to opening other types of data.

8.1.1. Prerequisites#

You will need to have the pandas package installed. You’ll need to ensure you have pandas installed. To do this, and to import pandas into your session, run

import pandas as pd

If this command fails, you don’t have pandas installed. Open up the terminal in Visual Studio Code (Terminal -> New Terminal) and type in conda install pandas. Note that once pandas is installed, the convention is to import it into your Python session under the name pd by putting import pandas as pd at the top of your script.

8.2. Getting Started#

There are a huge range of input and output formats available in pandas: Stata (.dta), Excel (.xls, .xlsx), csv, tsv, big data formats (HDF5, parquet), JSON, SAS, SPSS, SQL, and more; there’s a full list of formats available in the documentation.

From the pandas documentation

While pandas has a huge number of ways to read data in and load it into your Python session, here we’ll focus on the humble plain-text table file; for example csv (comma separated values) and tsv (tab separated values).

8.2.1. Reading data from a file#

All of the power needed to open plain-text table files is contained in a single function, pd.read_csv(). It takes numerous arguments but the two most important are the (unnamed) first one, which gives the path to the data, and sep= (a keyword argument) that tells pandas whether to expect values to be separated by commas or tabs or another character; however, if you leave this field blank, pandas will guess for you. To see the full set of arguments, run help(pd.read_csv).

Here is what a simple CSV file with a row for column names (also commonly referred to as the header row) and six rows of data looks like (using the terminal):

! cat data/students.csv
Student ID,Full Name,favourite.food,mealPlan,AGE

1,Sunil Huffmann,Strawberry yoghurt,Lunch only,4

2,Barclay Lynn,French fries,Lunch only,5

3,Jayendra Lyne,N/A,Breakfast and lunch,7

4,Leon Rossini,Anchovies,Lunch only,8

5,Chidiegwu Dunkel,Pizza,Breakfast and lunch,five

6,Güvenç Attila,Ice cream,Lunch only,6

Note that this is a CSV file, so the values are separated by commas. Now let’s load this into a pandas dataframe in Python:

students = pd.read_csv("data/students.csv")
students
Student ID Full Name favourite.food mealPlan AGE
0 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
1 2 Barclay Lynn French fries Lunch only 5
2 3 Jayendra Lyne NaN Breakfast and lunch 7
3 4 Leon Rossini Anchovies Lunch only 8
4 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
5 6 Güvenç Attila Ice cream Lunch only 6

If you want to download this data to try it for yourself, head to this link and then right-click on ‘Raw’ and select “Save Link As…”. Save the data in a directory called ‘data’ with the name ‘students.csv’. This directory should sit within your active Visual Studio Code folder. You can check which folder you’re currently in by running the following code:

import os

os.getcwd()  # get current working directory (cwd)

Say this comes back with ‘python4DS’, then your downloaded data should be in ‘python4DS/data/students.csv’.

The first argument to read_csv() was the path to the data, and pandas guessed that this file uses commas as the separator.

The read CSV function automatically creates a new index (which is just the position of each row) and takes the top line of data as the header or column names. But you may wish to tweak this behaviour, in several ways.

  1. Sometimes there are a few lines of metadata at the top of the file. You can use skiprows=n to skip the first n lines, eg pd.read_csv("data/students.csv", skiprows=2).

  2. The data might not have column names. You can use names = a list to tell read_csv() to use a different option for the column names. For example, pd.read_csv("data/students.csv", names=range(5)) would put the numbers 0 to 4 in as the column names.

  3. You may wish to change which column is used as the index. The default behaviour is to create an index, but for this data we see that there already is an ID column we could use. To do this, use the index_col= argument, for example pd.read_csv("data/students.csv", index_col=0).

This is all you need to know to read ~75% of CSV files that you’ll encounter in practice. Reading tab separated files and fixed width files is done with the same function.

8.2.2. First Steps#

Let’s take another look at the students data.

Once you read data in, the first step usually involves transforming it in some way to make it easier to work with in the rest of your analysis. For example, the column names in the students file we read in are formatted in non-standard ways.

You might consider renaming them one by one with .rename() or you might use a convenience function from another package to clean them and turn them all into snake case at once. We will make use of the skimpy package to do this. skimpy is a smaller package so isn’t available to install via conda; instead, install it by running pip install skimpy in the terminal.

From skimpy, we will use the clean_columns() function; this takes in a data frame and returns a data frame with variable names converted to snake case.

from skimpy import clean_columns

students = clean_columns(students)
students
student_id full_name favourite_food meal_plan age
0 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
1 2 Barclay Lynn French fries Lunch only 5
2 3 Jayendra Lyne NaN Breakfast and lunch 7
3 4 Leon Rossini Anchovies Lunch only 8
4 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
5 6 Güvenç Attila Ice cream Lunch only 6

Another common task after reading in data is to consider variable types. In the favourite_food column, there are a bunch of food items and then the value NaN, which has been read in as a floating point number rather than a missing string. We can solve this by casting that column to explicitly be composed of strings:

students["favourite_food"] = students["favourite_food"].astype("string")
students
student_id full_name favourite_food meal_plan age
0 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
1 2 Barclay Lynn French fries Lunch only 5
2 3 Jayendra Lyne <NA> Breakfast and lunch 7
3 4 Leon Rossini Anchovies Lunch only 8
4 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
5 6 Güvenç Attila Ice cream Lunch only 6

Similarly, "age" has mixed data types: string and integer! Let’s map the ‘five’ into the number five.

import numpy as np

students["age"] = students["age"].replace("five", 5)
students["age"]
0    4
1    5
2    7
3    8
4    5
5    6
Name: age, dtype: object

In a moment, we will turn this into a column of integers too.

Another example where the data type is wrong is meal_type. This is a categorical variable with a known set of possible values. pandas has a special data type for these:

students["meal_plan"] = students["meal_plan"].astype("category")
students["meal_plan"]
0             Lunch only
1             Lunch only
2    Breakfast and lunch
3             Lunch only
4    Breakfast and lunch
5             Lunch only
Name: meal_plan, dtype: category
Categories (2, object): ['Breakfast and lunch', 'Lunch only']

Note that the values in the meal_type variable has stayed exactly the same, but the type of variable has changed from the object to category.

It is a bit tedious to have to go through columns one-by-one as single line assignments to apply type. An alternative is to pass a dictionary that maps column names into types, like follows:

students = students.astype({"student_id": "int", "full_name": "string", "age": "int"})
students.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   student_id      6 non-null      int64   
 1   full_name       6 non-null      string  
 2   favourite_food  5 non-null      string  
 3   meal_plan       6 non-null      category
 4   age             6 non-null      int64   
dtypes: category(1), int64(2), string(2)
memory usage: 450.0 bytes

8.2.3. Exercises#

  1. What function would you use to read a file where fields were separated with “|”?

8.3. Reading data from multiple files#

Sometimes your data is split across multiple files instead of being contained in a single file. For example, you might have sales data for multiple months, with each month’s data in a separate file: 01-sales.csv for January, 02-sales.csv for February, and 03-sales.csv for March.

With pd.read_csv() you can read these data in one-by-one and then stack them on top of each other in a single data frame using the pd.concat() function. This looks like:

list_of_dataframes = [
    pd.read_csv(x)
    for x in ["data/01-sales.csv", "data/02-sales.csv", "data/03-sales.csv"]
]
sales_files = pd.concat(list_of_dataframes)
sales_files
month year brand item n
0 January 2019 1 1234 3
1 January 2019 1 8721 9
2 January 2019 1 1822 2
3 January 2019 2 3333 1
4 January 2019 2 2156 9
5 January 2019 2 3987 6
6 January 2019 2 3827 6
0 February 2019 1 1234 8
1 February 2019 1 8721 2
2 February 2019 1 1822 3
3 February 2019 2 3333 1
4 February 2019 2 2156 3
5 February 2019 2 3987 6
0 March 2019 1 1234 3
1 March 2019 1 3627 1
2 March 2019 1 8820 3
3 March 2019 2 7253 1
4 March 2019 2 8766 3
5 March 2019 2 8288 6

If you have many files you want to read in, it can get cumbersome to write out their names as a list. Instead, you can use the glob package (which is built in to Python) to find the files for you by matching a pattern in the file names. Note that there may be other CSV files in the directory data/, so here we specified "*-sales.csv" to ensure we only get those files that include the word sales. Here, "*" plays the role of a wildcard: it represents any series of characters.

import glob

list_of_csvs = glob.glob("data/*-sales.csv")
print("List of csvs is:")
print(list_of_csvs, "\n")
sales_files = pd.concat([pd.read_csv(x) for x in list_of_csvs])
sales_files
List of csvs is:
['data/03-sales.csv', 'data/02-sales.csv', 'data/01-sales.csv'] 
month year brand item n
0 March 2019 1 1234 3
1 March 2019 1 3627 1
2 March 2019 1 8820 3
3 March 2019 2 7253 1
4 March 2019 2 8766 3
5 March 2019 2 8288 6
0 February 2019 1 1234 8
1 February 2019 1 8721 2
2 February 2019 1 1822 3
3 February 2019 2 3333 1
4 February 2019 2 2156 3
5 February 2019 2 3987 6
0 January 2019 1 1234 3
1 January 2019 1 8721 9
2 January 2019 1 1822 2
3 January 2019 2 3333 1
4 January 2019 2 2156 9
5 January 2019 2 3987 6
6 January 2019 2 3827 6

8.4. Writing to a file#

Just as the typical pattern for reading files is pd.read_FILETYPE(), where filetype can be, for example, CSV, all of the ways of writing pandas data frames to disk have the pattern DATAFRAME.to_FILETYPE(). So to write our sales data to a CSV file, the code will be sales_files.to_csv(FILEPATH), where filepath is the location plus name of the file you want to write to.

Let’s see an example of writing data to file using our students data that we already did such good work on setting the data types for:

students.to_csv("data/students-clean.csv")

Now let’s read it back in and check the info on data types:

pd.read_csv("data/students-clean.csv").info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Unnamed: 0      6 non-null      int64 
 1   student_id      6 non-null      int64 
 2   full_name       6 non-null      object
 3   favourite_food  5 non-null      object
 4   meal_plan       6 non-null      object
 5   age             6 non-null      int64 
dtypes: int64(3), object(3)
memory usage: 416.0+ bytes

Notice anything? We lost a lot of the nice data type work we did! While pandas guessed that some columns are integers, we lost the string and categorical variables. The reason for this is that plain text files are not able to carry any contextual information (though pandas will guess some column data types).

If you want to save data in a file and have it remember the data types, you’ll need to use a different data format. For temporary storage, we recommend using the feather format as it is very fast and interoperable with other programming languages. Interoperability is a good reason to avoid language-specific file formats such as Stata’s .dta, R’s .rds, and Python’s .pickle.

Note that the feather format has an additional dependency in the form of a package called pyarrow. To install it, run pip install pyarrow in a terminal window.

Here’s an example of writing to a feather file:

students.to_feather("data/students-clean.feather")

Now let’s re-open that feather file and take a look at the info attached to it.

pd.read_feather("data/students-clean.feather").info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype   
---  ------          --------------  -----   
 0   student_id      6 non-null      int64   
 1   full_name       6 non-null      string  
 2   favourite_food  5 non-null      string  
 3   meal_plan       6 non-null      category
 4   age             6 non-null      int64   
dtypes: category(1), int64(2), string(2)
memory usage: 450.0 bytes

Saving to this format preserved our data type information.

8.4.1. Reading and Writing Other Data Formats#

The image at the start of this chapter gives you a sense of what other formats are available but you can find a comprehensive list over at the official pandas documentation on input and output.