A huge amount of data lives in databases, so it’s essential that you know how to access them. Sometimes you can ask someone to download a snapshot into a .csv for you, but this gets painful quickly: every time you need to make a change you’ll have to communicate with another human. You want to be able to reach into the database directly to get the data you need, when you need it.

In this chapter, you’ll first learn the basics of Python packages that can interact with SQL databses: how to use them to connect to a database and then retrieve data with a SQL[1] query. SQL, short for structured query language, is the lingua franca of databases, and is an important language for all data scientists to learn. You won’t become a SQL master by the end of the chapter, but you will be able to identify the most important components and understand what they do.


You will need the pandas, SQLModel, and ibis packages for this chapter. You probably already have pandas installed; to install SQLModel and ibis respectively run pip install sqlmodel and pip install ibis-framework on your computer’s command line. First, let’s bring in some general packages and turn off verbose warnings.

import warnings
from pathlib import Path


Database Basics#

At the simplest level, you can think about a database as a collection of data frames, called tables in database terminology. Like a pandas dataframe, a database table is a collection of named columns, where every value in the column is the same type. There are three high level differences between data frames and database tables:

  • Database tables are stored on disk (ie on file) and can be arbitrarily large. Data frames are stored in memory, and are fundamentally limited (although that limit is still big enough for many problems). You can think about the difference between on disk and in memory as being like the difference between long-term and short-term memory (and you have much more limited capacity in the latter).

  • Database tables almost always have indexes. Much like the index of a book, a database index makes it possible to quickly find rows of interest without having to look at every single row.

  • Most classical databases are optimised for rapidly collecting data, not analysing existing data. These databases are called row-oriented because the data is stored row-by-row, rather than column-by-column. More recently, there’s been much development of column-oriented databases that make analysing the existing data much faster.

Databases are run by database management systems (DBMS’s for short), which come in three basic forms:

  • Client-server DBMS’s run on a powerful central server, which you connect from your computer (the client). They are great for sharing data with multiple people in an organisation. Popular client-server DBMS’s include PostgreSQL, MariaDB, SQL Server, and Oracle.

  • Cloud DBMS’s, like Snowflake, Amazon’s RedShift, and Google’s BigQuery, are similar to client server DBMS’s, but they run in the cloud. This means that they can easily handle extremely large datasets and can automatically provide more compute resources as needed.

  • In-process DBMS’s, like SQLite or duckdb, run entirely on your computer. They’re great for working with large datasets where you’re the primary user.

Connecting to Databases#

To connect to a database from Python, there are a number of options, but it does depend on what kind of database you are connecting to. The steps (eg connect to the database, upload data, run SQL series) are similar though, so we won’t lose much by just choosing one example—a SQLite database. SQLite is a small, fast, self-contained, SQL database engine and the most used database engine in the world. A lot of the data on your computer and mobile phone will in fact be in SQLite databases. (Simon Willison has some great tools to help export some of the data that you’ve created!)

We will always need two steps though, regardless of what kind of SQL database we are connecting to:

  • You’ll always use a database interface that provides a connection to the database, for example Python’s built-in sqlite package

  • You’ll also use a package that pushes and/or pulls data to/from the database, for example pandas

The precise details of the connection varies a lot from DBMS to DBMS so unfortunately we can’t cover all the details here. The initial setup will often take a little fiddling (and maybe some research) to get right, but you’ll generally only need to do it once. We’ll do the best we can to cover some basics here.

Setting up a client-server or cloud DBMS would be a pain for this book, so we’ll instead use an in-computer DBMS.

Working with Databases Directly#

Let’s connect to a small SQLite database called the Chinook database, which contains information about the artists, songs, and albums from a music shop, as well as information on the shop’s employees, customers, and the customers purchases. This information is contained in eleven tables. The figure below shows the data schema:

Data schema of Chinook database

You can download the file, Chinook.sqlite, from the github repository of this book here; on your computer, you’ll need to save it in a subdirectory (relative to where your code is being executed) called ‘data’ to work through the exercises below.

For starters, let’s use Python’s built in sqlite3 engine to connect to the database and execute a very simple SQL query to select the first ten entries from the ‘Artists’ table:

import sqlite3

con = sqlite3.connect(Path("data/Chinook.sqlite"))

cursor = con.execute("SELECT * FROM Artist LIMIT 10;")
rows = cursor.fetchall()
[(1, 'AC/DC'),
 (2, 'Accept'),
 (3, 'Aerosmith'),
 (4, 'Alanis Morissette'),
 (5, 'Alice In Chains'),
 (6, 'Antônio Carlos Jobim'),
 (7, 'Apocalyptica'),
 (8, 'Audioslave'),
 (9, 'BackBeat'),
 (10, 'Billy Cobham')]

Note that the output here is in the form a Python object called a tuple. If we wanted to put this into a pandas dataframe, we can just pass it straight in:

import pandas as pd

0 1
0 1 AC/DC
1 2 Accept
2 3 Aerosmith
3 4 Alanis Morissette
4 5 Alice In Chains
5 6 Antônio Carlos Jobim
6 7 Apocalyptica
7 8 Audioslave
8 9 BackBeat
9 10 Billy Cobham

Another useful hint is that if you’re not sure what the column names are, you can obtain them from:

[i[0] for i in cursor.description]
['ArtistId', 'Name']

Creating a Database#

Often, you’ll want to create a SQL database to later (efficiently) access cuts of data. Let’s create a test database directly using the sqlite package. This process involves a CREATE TABLE statement, then the name of the table followed by the names of the columns and their data types.

create_query = """CREATE TABLE test (country VARCHAR(20), gdp REAL, health INTEGER);"""
con_new = sqlite3.connect("data/test_database.sqlite")

If this runs, then you’ve created a test database! (You can check the data directory on your own computer to see if it worked, but you’ll get an error if the data directory doesn’t already exist.)

Let’s now fill the database with some values:

test_data = [("US", 1, 3), ("UK", 0.6, 2), ("France", 0.8, 1)]

con_new.executemany("INSERT INTO test VALUES(?, ?, ?)", test_data)

Finally, let’s check if this has worked:

con_new.execute("SELECT * FROM test").fetchall()
[('US', 1.0, 3), ('UK', 0.6, 2), ('France', 0.8, 1)]

Whahey, this worked!

Basic SQL Queries#

A full description of SQL queries is outside the scope of this book, but we’ll try and arm you with the basics. The top-level components of SQL are called statements. Common statements include CREATE for defining new tables, INSERT for adding data, and SELECT for retrieving data. We will on focus on SELECT statements, also called queries, because they are almost exclusively what you’ll use as a data scientist.

A query is made up of clauses. There are six important clauses: SELECT, FROM, WHERE, ORDER BY, GROUP BY, and LIMIT. Every query must have the SELECT[2] and FROM[3] clauses and the simplest query is SELECT * FROM table, which selects all columns from a specified table called “table”. WHERE and ORDER BY control which rows are included and how they are ordered. GROUP BY converts the query to a summary, causing aggregation to happen. LIMIT limits how many rows are returned.

Important: in SQL, the order matters. You must always write the clauses in the order SELECT, FROM, WHERE, GROUP BY, ORDER BY. Confusingly, this order doesn’t match how the clauses actually evaluated which is first FROM, then WHERE, GROUP BY, SELECT, and ORDER BY.

Let’s look at a few of these in action. To get a few entries, it’s


To be specific about columns, it’s

SELECT name FROM Artist LIMIT 10;

To reverse order, use “ORDER BY” and then “DESC”. For example, this query gives us the ten longest tracks

SELECT name, milliseconds FROM track ORDER BY milliseconds DESC LIMIT 10;

Another headline feature is being able to filter. For example, we could ask for the first ten tracks that are at least 3 minutes long. Let’s run this one for real:

sql_query = "SELECT name, milliseconds FROM track WHERE milliseconds > 1e3*3*60 ORDER BY milliseconds ASC LIMIT 10;"
cursor = con.execute(sql_query)
rows = cursor.fetchall()
[('Bodies', 180035),
 ('Vivo Isolado Do Mundo', 180035),
 ('Elvis Ate America', 180166),
 ('Remote Control', 180297),
 ('Promises', 180401),
 ('Emergency', 180427),
 ('À Vontade (Live Mix)', 180636),
 ('Hyperconectividade', 180636),
 ('On Fire', 180636),
 ('Fascinação', 180793)]

Let’s try a groupby. This one will groupby album to find the average track length in minutes for (the first five) albums:

sql_groupby = "SELECT albumid, AVG(milliseconds)/1e3/60 FROM track GROUP BY albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;"
cursor = con.execute(sql_groupby)
rows = cursor.fetchall()
[(340, 0.863),
 (345, 1.11065),
 (318, 1.6882166666666667),
 (314, 1.69135),
 (328, 1.8377666666666668)]


If you’re familiar with joins in pandas, SQL joins are very similar. Let’s see if we can join the ‘album’ and ‘track’ tables to find the name of the albums in the above query.

Note that as soon as we have the same column names in more than one table, we need to specify the table we are referring to when we use that column name. There are different options for joins (eg INNER, LEFT) that you can find out more about here.

sql_join = "SELECT track.albumid, AVG(milliseconds)/1e3/60, album.title FROM track INNER JOIN album ON (track.albumid = album.albumid) GROUP BY album.albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;"
cursor = con.execute(sql_join)
rows = cursor.fetchall()
[(340, 0.863, "Liszt - 12 Études D'Execution Transcendante"),
 (345, 1.11065, "Monteverdi: L'Orfeo"),
 (318, 1.6882166666666667, 'SCRIABIN: Vers la flamme'),
 (314, 1.69135, 'English Renaissance'),
 (328, 1.8377666666666668, 'Charpentier: Divertissements, Airs & Concerts')]


We already snuck in one function: AVG. Others that you can apply to columns include SUM, MIN, MAX, FIRST, LAST and COUNT (the data scientist’s friend). You can find out more about SQL functions here.

Exploring SQL Databases with Datasette#

SQL can be pretty intimidating at first, but there’s a great tool for getting to know it better called Datasette. Datasette is a tool for exploring (and creating and sharing—but that’s covered in Sharing Data) databases. It’s author says that “Datasette is aimed at data journalists, museum curators, archivists, local governments, scientists, researchers and anyone else who has data that they wish to share with the world. It is part of a wider ecosystem of 40 tools and 99 plugins dedicated to making working with structured data as productive as possible.”

In this context, Datasette is useful in two ways:

  • to explore a database interactively, through pointing and clicking

  • to see how a SQL SELECT statement is built up of filtering operations or vice versa

Essentially, when Datasette is running, you’ll see a landing web page (yes, even if you’re running it on your own computer) that presents all of the tables in your dataset. You can click through to tables to explore them, re-order them, filter them, and more. You can also see the equivalent SQL query for any cut you make, or run a SQL query in the box to retrieve any cut you like. But note that Datasette only supports SELECT statements.

A useful extra feature is that you can download the results of your SQL query as a CSV or JSON file from the web page run by Datasette once you have filtered to what you want.

There are a few different ways to use Datasette to explore data:

  • you can run it on your own computer

  • you can try out an online version (which has been hosted already on the cloud), for example this database of power stations

  • you can use the online coding service glitch to run it. See an example here.

Datasette comes as a Python package that you can install on the command line by running pip install datasette. Once you have it installed in a Python environment, run

datasette path/to/database.db -o

and you should find that your default browser opens straight away on a page showing the tables in the database (the page should have an address beginning http://localhost:8001/).

Finally, there’s some training for SQL that uses Datasette available here.


Using the Chinook database on your own computer, run the SQL query with the join from before, ie:

  AVG(milliseconds) / 1e3 / 60
  AVG(milliseconds) ASC

SQL with pandas#

pandas is well-equipped for working with SQL. We can simply push the query we just created straight through using its read_sql() function—but bear in mind we need to pass in the connection we created to the database too:

pd.read_sql(sql_join, con)
AlbumId AVG(milliseconds)/1e3/60 Title
0 340 0.863000 Liszt - 12 Études D'Execution Transcendante
1 345 1.110650 Monteverdi: L'Orfeo
2 318 1.688217 SCRIABIN: Vers la flamme
3 314 1.691350 English Renaissance
4 328 1.837767 Charpentier: Divertissements, Airs & Concerts

One nice feature of this is that the column names in SQL get passed straight to the column names in our dataframe.

Now, when you’re writing Python in Visual Studio Code (at least with the Python extensions installed), you get a lot of high quality syntax and auto-completion support. Extensions to the Python language also allow you to take a great deal of care over the types of variables that you are dealing with. Wouldn’t it be nice to have all of that with SQL too (even when accessing it via Python)? The next two packages we’ll look at provide that. Both make working with SQL databases from Python a lot easier and more productive.

SQL with ibis#

It’s not exactly satisfactory to have to write out your SQL queries in text. What if we could create commands directly from pandas commands? You can’t quite do that, but there’s a package that gets you pretty close and it’s called ibis. ibis is particularly useful when you are reading from a database and want to query it just like you would a pandas dataframe.

Ibis can connect to local databases (eg a SQLite database), server-based databases (eg Postgres), or cloud-based databased (eg Google’s BigQuery). The syntax to make a connection is, for example, ibis.bigquery.connect().

Let’s see ibis in action by reproducing the commands we’ve already seen on the Chinook database, a locally hosted database. First we import it and make a connection to the database.

import ibis

ibis.options.interactive = True
connection = ibis.sqlite.connect(Path("data/Chinook.sqlite"))
track = connection.table("track")
┃ TrackId  Name                                     AlbumId  MediaTypeId  GenreId  Composer                                                                Milliseconds  Bytes     UnitPrice       ┃
│ !int32!stringint32!int32int32string!int32int32!decimal(10, 2) │
│       1For Those About To Rock (We Salute You)111Angus Young, Malcolm Young, Brian Johnson                             343719111703340.99 │
│       2Balls to the Wall                      221NULL34256255104240.99 │
│       3Fast As a Shark                        321F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman                   23061939909940.99 │
│       4Restless and Wild                      321F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman25205143317790.99 │
│       5Princess of the Dawn                   321Deaffy & R.A. Smith-Diesel                                            37541862905210.99 │

As an aside, you can list the available tables using


Okay, now let’s reproduce the following query: “SELECT albumid, AVG(milliseconds)/1e3/60 FROM track GROUP BY albumid ORDER BY AVG(milliseconds) ASC LIMIT 5;”. We’ll use a groupby, a mutate (which you can think of like pandas’ assign statement), a sort, and then limit() to only show the first five entries.

┃ TrackId  Name                                                      AlbumId  MediaTypeId  GenreId  Composer            Milliseconds  Bytes    UnitPrice        mean_mins_track ┃
│ !int32!stringint32!int32int32string!int32int32!decimal(10, 2)float64         │
│    3496Étude 1, In C Major - Preludio (Presto) - Liszt         340424NULL5178022296170.990.863000 │
│    3501L'orfeo, Act 3, Sinfonia (Orchestra)                    345224Claudio Monteverdi6663911890620.991.110650 │
│    3452SCRIABIN: Prelude in B Major, Op. 11, No. 11            318424NULL10129338195350.991.688217 │
│    3448Lamentations of Jeremiah, First Set \ Incipit Lamentatio314224Thomas Tallis     6919412080800.991.691350 │
│    3492Sing Joyfully                                           314224William Byrd      13376822564840.991.691350 │

What about joins? Of course, you can do these too. As an example, let’s join the genre and track tables on the shared variable, “GenreId”.

genre = connection.table("genre")
genre_and_track = track.inner_join(
    genre, predicates=track["GenreId"] == genre["GenreId"]

┃ TrackId  Name                                     AlbumId  MediaTypeId  GenreId  Composer                                                                Milliseconds  Bytes     UnitPrice       Name_right ┃
│ int32stringint32int32int32stringint32int32decimal(10, 2)string     │
│       1For Those About To Rock (We Salute You)111Angus Young, Malcolm Young, Brian Johnson                             343719111703340.99Rock       │
│       2Balls to the Wall                      221NULL34256255104240.99Rock       │
│       3Fast As a Shark                        321F. Baltes, S. Kaufman, U. Dirkscneider & W. Hoffman                   23061939909940.99Rock       │
│       4Restless and Wild                      321F. Baltes, R.A. Smith-Diesel, S. Kaufman, U. Dirkscneider & W. Hoffman25205143317790.99Rock       │
│       5Princess of the Dawn                   321Deaffy & R.A. Smith-Diesel                                            37541862905210.99Rock       │
│       6Put The Finger On You                  111Angus Young, Malcolm Young, Brian Johnson                             20566267134510.99Rock       │
│       7Let's Get It Up                        111Angus Young, Malcolm Young, Brian Johnson                             23392676365610.99Rock       │
│       8Inject The Venom                       111Angus Young, Malcolm Young, Brian Johnson                             21083468528600.99Rock       │
│       9Snowballed                             111Angus Young, Malcolm Young, Brian Johnson                             20310265994240.99Rock       │
│      10Evil Walks                             111Angus Young, Malcolm Young, Brian Johnson                             26349786112450.99Rock       │
│                 │

Ibis does a lot more than is shown here, including lazy evaluation, geospatial work, typing, and of course all of the common SQL commands that you’d expect. For more, check out the documentation.

SQL with SQLModel#

Ibis isn’t the only smart option for working with SQL from Python; SQLModel is another option. Use sqlmodel when you want to create production-grade SQL databases or want to query them using a slightly different flavour of syntax to ibis with extremely rich support.

So a couple of key strengths of sqlmodel include fantastic auto-complete support and being very strict on datatypes (which will save time in the long run, especially if you are creating databases).

First, make sure you have the package installed by running pip install sqlmodel on the command line.

Creating a SQL Database with SQLModel#

Imagine you have a SQL table called “hero” and you want it to have this data:







Dive Wilson




Pedro Parqueador




Tommy Sharp



Ms Amazing

Barjabeen Bhabra


Now, to create the SQL table, we write a structure called a class. Classes look a bit like functions but, in the body of this class, we list the name of the field followed by the Python code for the type of data represented:

from typing import Optional

from sqlmodel import Field, Session, SQLModel, create_engine

class Hero(SQLModel, table=True):
    id: Optional[int] = Field(default=None, primary_key=True)
    name: str
    secret_name: str
    age: Optional[int] = None

We then create the rows of data (this could be done programmatically but we’ll write them out here). Note that there is autocompletion of the fields in the below!

hero_1 = Hero(name="Deadpond", secret_name="Dive Wilson")
hero_2 = Hero(name="Spider-Boy", secret_name="Pedro Parqueador")
hero_3 = Hero(name="Rusty-Man", secret_name="Tommy Sharp", age=48)
hero_4 = Hero(name="Ms Amazing", secret_name="Barjabeen Bhabra", age=17)

Now we need to create the actual database and connect to it. The syntax is the name of the type of database, here sqlite, and then :/// followed by the relative path to where you want to store the database.

engine = create_engine("sqlite:///data/hero.db")


with Session(engine) as session:

We can now get a read-out from the database we created using the direct approach (ie using the sqlite3 package):

con = sqlite3.connect(Path("data/hero.db"))

con.execute("SELECT * FROM hero;").fetchall()
[(1, 'Deadpond', 'Dive Wilson', None),
 (2, 'Spider-Boy', 'Pedro Parqueador', None),
 (3, 'Rusty-Man', 'Tommy Sharp', 48),
 (4, 'Ms Amazing', 'Barjabeen Bhabra', 17)]

Reading from SQL using SQLModel#

SQLModel translates all of the usual SQL statements into functions. “SELECT” is no different. So let’s see an example of reading the SQL database directly from SQLModel:

from sqlmodel import select

with Session(engine) as session:
    statement = select(Hero)  # Equivalent to SELECT * from hero
    results = session.exec(statement)  # Execute the command
    for hero in results:
age=None name='Deadpond' id=1 secret_name='Dive Wilson'
age=None name='Spider-Boy' id=2 secret_name='Pedro Parqueador'
age=48 name='Rusty-Man' id=3 secret_name='Tommy Sharp'
age=17 name='Ms Amazing' id=4 secret_name='Barjabeen Bhabra'

We can also get all of it back via results.all(). The resulting object is a list of “hero” objects whose properties can be accessed via heroes[0].name and so on.

with Session(engine) as session:
    statement = select(Hero)
    results = session.exec(statement)
    heroes = results.all()

print(*heroes, sep="\n")
age=None name='Deadpond' id=1 secret_name='Dive Wilson'
age=None name='Spider-Boy' id=2 secret_name='Pedro Parqueador'
age=48 name='Rusty-Man' id=3 secret_name='Tommy Sharp'
age=17 name='Ms Amazing' id=4 secret_name='Barjabeen Bhabra'

You can find more info in the documentation for SQLModel, but most of the SQL query commands you’d expect are available. Let’s run an example from earlier:

with Session(engine) as session:
    statement = select(Hero).where(Hero.age < 100).limit(2).order_by(Hero.age)
    results = session.exec(statement)
    heroes = results.all()

print(*heroes, sep="\n")
age=17 name='Ms Amazing' id=4 secret_name='Barjabeen Bhabra'
age=48 name='Rusty-Man' id=3 secret_name='Tommy Sharp'

This has barely scraped the surface of what SQLModel can do: joins, updating data, deleting rows, and more are all possible too.

Where to Learn More About SQL#

If you’ve finished this chapter and would like to learn more about SQL, here are some further resources:

  • SQL for Data Scientists by Renée M. P. Teate is an introduction to SQL designed specifically for the needs of data scientists, and includes examples of the sort of highly interconnected data you’re likely to encounter in real organisations.

  • Practical SQL by Anthony DeBarros is written from the perspective of a data journalist (a data scientist specialised in telling compelling stories) and goes into more detail about getting your data into a database and running your own DBMS.

  • Check out one of the many sql cheatsheets that have all of the basics written in one place

  • Try out sql in your browser at SQLime

  • Learn SQL by solving a murder mystery