Unsupervised Machine Learning#

Introduction#

In this chapter, we’re going to look at unsupervised machine learning. We’ll be looking at clustering and dimensional reduction in particular. We’ll be leaning heavily on scikit-learn. Also in the category of unsupervised machine learning, but not covered here, are anomaly detection and outlier detection.

Before we get cracking on the chapter content, let’s do our normal set of imports.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

random_state = 42  # We'll use this throughout to make this page reproducible
prng = np.random.default_rng(random_state)

Clustering#

To cluster data is to use a set of high-dimensional features to create a grouping wherein data points in the same cluster are more similar to each other than data points in distinct clusters.

When might you wish to cluster? You can use clusters to get informative fixed effects based on other data. You can use clustering to create features for (supervised) machine learning. You can use clusters to help you understand different types of (groups of) data in your sample. Some more applied examples are: market segmentation, image segmentation, and anomaly detection.

As with much of the API, there are two ways to use the clustering algorithms found in the scikit-learn package: the first implements the fit method to learn clusters on training data; the second is a function, that, given training data, returns an array of integer labels corresponding to the different clusters. As ever, there are fit, fit_tranform, and transform methods so that you can create clusters, create and allocate data to clusters, or just allocate data to existing clusters, respectively.

An example: K-means#

To help illustrate exactly what’s going on in clustering, let’s look at a specific example using the K-means algorithm. K-means creates clusters by trying to separate samples in \(K\) groups of equal variance, minimising a criterion known as the inertia or within-cluster sum-of-squares:

\[ \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2) \]

where a set of \(N\) samples is sorted into \(K\) disjoint clusters \(C\) each described by the mean point \(\mu_j\) of the samples in that cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from the input data, \(X\).

Let’s see an example of K-means in action. First, let’s generate some data

from sklearn.datasets import make_blobs

batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X_to_cluster, labels_true = make_blobs(
    n_samples=3000, centers=centers, cluster_std=0.7, random_state=random_state
)
X_to_cluster
array([[ 1.56876808,  1.94936802],
       [ 1.47695363, -1.8075846 ],
       [ 1.25807132,  0.72466283],
       ...,
       [-0.7031532 , -0.97238707],
       [ 0.79388097,  0.01692888],
       [ 0.60502668,  1.12918591]])

Now we fit K-means

from sklearn.cluster import KMeans

k_means = KMeans(init="k-means++", n_clusters=3, n_init=10)
k_means.fit(X_to_cluster)
KMeans(n_clusters=3, n_init=10)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And plot the results:

_images/07733c068c422e49b0ebd1b898e95cc226094d747e41396adadd4cc5743cfbf6.svg

Clustering algorithms in scikit-learn#

There are too many clustering algorithms just in scikit-learn to cover them in any real detail so instead we will simply reproduce this ridiculously useful chart and table that show all of the package’s clustering algorithms and their strengths and weaknesses.

_images/7df69d34220ea9507fffbd017d6378d17f72283c095361121480a04d7dff6bb8.svg

Method name

Parameters

Scalability

Usecase

Geometry (metric used)

K-Means

number of clusters

Very large n_samples, medium n_clusters with MiniBatch code

General-purpose, even cluster size, flat geometry, not too many clusters, inductive

Distances between points

Affinity propagation

damping, sample preference

Not scalable with n_samples

Many clusters, uneven cluster size, non-flat geometry, inductive

Graph distance (e.g. nearest-neighbor graph)

Mean-shift

bandwidth

Not scalable with n_samples

Many clusters, uneven cluster size, non-flat geometry, inductive

Distances between points

Spectral clustering

number of clusters

Medium n_samples, small n_clusters

Few clusters, even cluster size, non-flat geometry, transductive

Graph distance (e.g. nearest-neighbor graph)

Ward hierarchical clustering

number of clusters or distance threshold

Large n_samples and n_clusters

Many clusters, possibly connectivity constraints, transductive

Distances between points

Agglomerative clustering

number of clusters or distance threshold, linkage type, distance

Large n_samples and n_clusters

Many clusters, possibly connectivity constraints, non Euclidean distances, transductive

Any pairwise distance

DBSCAN

neighborhood size

Very large n_samples, medium n_clusters

Non-flat geometry, uneven cluster sizes, outlier removal, transductive

Distances between nearest points

HDBSCAN

minimum cluster membership, minimum point neighbors

large n_samples, medium n_clusters

Non-flat geometry, uneven cluster sizes, outlier removal, transductive, hierarchical, variable cluster density

Distances between nearest points

OPTICS

minimum cluster membership

Very large n_samples, large n_clusters

Non-flat geometry, uneven cluster sizes, variable cluster density, outlier removal, transductive

Distances between points

Gaussian mixtures

many

Not scalable

Flat geometry, good for density estimation, inductive

Mahalanobis distances to centers

BIRCH

branching factor, threshold, optional global clusterer.

Large n_clusters and n_samples

Large dataset, outlier removal, data reduction, inductive

Euclidean distance between points

Bisecting K-Means

number of clusters

Very large n_samples, medium n_clusters

General-purpose, even cluster size, flat geometry, no empty clusters, inductive, hierarchical

Distances between points

Dimensional Reduction#

Dimensional reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data. Reducing dimensions can be helpful because of the curse of dimensionality and the sparsity of data in higher dimensions making it either computationally infeasible or perhaps impossible to run standard algorithms. It can also be helpful to summarise and visualise high-dimensional information in a continuous way (rather than the discrete way offered by clustering).

Factor analysis is one type of dimensional reduction that may already be familiar to you as an economist. This is available in scikit-learn, along with a range of other reduction algorithms.

In the rest of this chapter, we’ll showcase just two: principal component analysis and UMAP.

Principal Component Analysis#

The absolute bread and butter of dimensional reduction algorithms is principal component analysis, so we’ll start with that as a good example of how to use dimensional reduction algorithms in general with scikit-learn-like APIs. We will project the 4-dimensional Iris data down into two dimensions using PCA.

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
X_iris, y_iris = iris.data, iris.target
target_names_iris = iris.feature_names
pca_iris = PCA(n_components=2)
print(f"The dimensionality of the raw data is: {X_iris.shape}")
X_dim_red = pca_iris.fit_transform(X_iris)
print(
    f"explained variance ratio (1st two components): {(pca_iris.explained_variance_ratio_)}"
)
X_dim_red.shape
The dimensionality of the raw data is: (150, 4)
explained variance ratio (1st two components): [0.92461872 0.05306648]
(150, 2)
colors = ["navy", "turquoise", "darkorange"]
lw = 2

fig, ax = plt.subplots()
for color, i, target_name in zip(colors, [0, 1, 2], target_names_iris):
    ax.scatter(
        X_dim_red[y_iris == i, 0],
        X_dim_red[y_iris == i, 1],
        color=color,
        alpha=0.8,
        lw=lw,
        label=target_name,
    )
ax.legend(loc="best", shadow=False, scatterpoints=1)
ax.set_title("PCA of IRIS dataset")
plt.show()
_images/27f2d884d5c0e9b5992163404df66f06909fa7e210ca670dcbf2f106b83e2ea1.svg

You can see here that we have projected out onto two principal dimensions (axes) and that there is some good separation of points.

UMAP, or Uniform Manifold Approximation and Projection for Dimension Reduction#

UMAP is a relatively recent algorithm [McInnes, Healy, and Melville, 2018] that is somewhat similar to t-SNE (t-distributed Stochastic Neighbor Embedding). Both have issues with reproducing structure and should be used with care but both can be quite impresssive with the right parameters. t-SNE is included in scikit-learn, but for UMAP, you’ll need to install the umap-learn package.

We will use the penguins dataset to demo UMAP (following the documentation).

penguins = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
)
penguins["sex"] = penguins["sex"].str.title()
penguins.head()
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female
3 Adelie Torgersen NaN NaN NaN NaN NaN
4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female

As ever, it does help to clean up the data. We will drop NAs and convert each feature into z-scores (number of standard deviations from the mean) for comparability. We will also only use the numeric columns.

import umap

reducer = umap.UMAP()
penguins_data = penguins.dropna(how="any").loc[
    :,
    [
        "bill_length_mm",
        "bill_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
    ],
]
scaled_penguin_data = StandardScaler().fit_transform(penguins_data)

Let’s now create the embedding using UMAP:

embedding = reducer.fit_transform(scaled_penguin_data)
embedding.shape
(333, 2)

And plot the results

species = list(penguins["species"].unique())
species_map = {i: x for i, x in zip(range(len(species)), species)}

fig, ax = plt.subplots()
for i, spec_name in species_map.items():
    this_cut = penguins.dropna(how="any")["species"].astype("category").cat.codes == i
    ax.scatter(
        embedding[this_cut, 0],
        embedding[this_cut, 1],
        c=plt.rcParams["axes.prop_cycle"].by_key()["color"][i],
        label=spec_name,
    )
ax.set_title("UMAP projection of the Penguins dataset")
ax.legend()
plt.show()
_images/c6d286a337747ea1c7ff6cd08cfb8b702e1bdae41bf010e8263fee123e4f1bef.svg

This does a useful job of capturing the structure of the data, which you can check for yourself by looking at a pairplot of all of the numerical fields.

While the benefits of dimensional reduction are small for such low dimensional datasets, as soon as you are dealing with 6 or more dimensions, perhaps as many as thousands, the benefits become larger.