Unsupervised Machine Learning

Unsupervised Machine Learning#

Introduction#

In this chapter, we’re going to look at unsupervised machine learning. We’ll be looking at clustering and dimensional reduction in particular. We’ll be leaning heavily on scikit-learn. Also in the category of unsupervised machine learning, but not covered here, are anomaly detection and outlier detection.

Before we get cracking on the chapter content, let’s do our normal set of imports.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

random_state = 42  # We'll use this throughout to make this page reproducible
prng = np.random.default_rng(random_state)

Clustering#

To cluster data is to use a set of high-dimensional features to create a grouping wherein data points in the same cluster are more similar to each other than data points in distinct clusters.

When might you wish to cluster? You can use clusters to get informative fixed effects based on other data. You can use clustering to create features for (supervised) machine learning. You can use clusters to help you understand different types of (groups of) data in your sample. Some more applied examples are: market segmentation, image segmentation, and anomaly detection.

As with much of the API, there are two ways to use the clustering algorithms found in the scikit-learn package: the first implements the fit method to learn clusters on training data; the second is a function, that, given training data, returns an array of integer labels corresponding to the different clusters. As ever, there are fit, fit_tranform, and transform methods so that you can create clusters, create and allocate data to clusters, or just allocate data to existing clusters, respectively.

An example: K-means#

To help illustrate exactly what’s going on in clustering, let’s look at a specific example using the K-means algorithm. K-means creates clusters by trying to separate samples in \(K\) groups of equal variance, minimising a criterion known as the inertia or within-cluster sum-of-squares:

\[ \sum_{i=0}^{n}\min_{\mu_j \in C}(||x_i - \mu_j||^2) \]

where a set of \(N\) samples is sorted into \(K\) disjoint clusters \(C\) each described by the mean point \(\mu_j\) of the samples in that cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from the input data, \(X\).

Let’s see an example of K-means in action. First, let’s generate some data

from sklearn.datasets import make_blobs

batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X_to_cluster, labels_true = make_blobs(
    n_samples=3000, centers=centers, cluster_std=0.7, random_state=random_state
)
X_to_cluster

array([[ 1.56876808,  1.94936802],
       [ 1.47695363, -1.8075846 ],
       [ 1.25807132,  0.72466283],
       ...,
       [-0.7031532 , -0.97238707],
       [ 0.79388097,  0.01692888],
       [ 0.60502668,  1.12918591]])

Now we fit K-means

from sklearn.cluster import KMeans

k_means = KMeans(init="k-means++", n_clusters=3, n_init=10)
k_means.fit(X_to_cluster)

KMeans(n_clusters=3, n_init=10)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

And plot the results:

_images/c525208dd85c09770660327a27a165b83cbc81dca82ea8c6f1e40529c96f968b.svg

Clustering algorithms in scikit-learn#

There are too many clustering algorithms just in scikit-learn to cover them in any real detail so instead we will simply reproduce this ridiculously useful chart and table that show all of the package’s clustering algorithms and their strengths and weaknesses.

_images/303c4c34ce56b10b0dc8afb630f36419893b4d044ab9a5699092a408118c8863.svg

Method name	Parameters	Scalability	Usecase	Geometry (metric used)
K-Means	number of clusters	Very large n_samples, medium n_clusters with MiniBatch code	General-purpose, even cluster size, flat geometry, not too many clusters, inductive	Distances between points
Affinity propagation	damping, sample preference	Not scalable with n_samples	Many clusters, uneven cluster size, non-flat geometry, inductive	Graph distance (e.g. nearest-neighbor graph)
Mean-shift	bandwidth	Not scalable with n_samples	Many clusters, uneven cluster size, non-flat geometry, inductive	Distances between points
Spectral clustering	number of clusters	Medium n_samples, small n_clusters	Few clusters, even cluster size, non-flat geometry, transductive	Graph distance (e.g. nearest-neighbor graph)
Ward hierarchical clustering	number of clusters or distance threshold	Large n_samples and n_clusters	Many clusters, possibly connectivity constraints, transductive	Distances between points
Agglomerative clustering	number of clusters or distance threshold, linkage type, distance	Large n_samples and n_clusters	Many clusters, possibly connectivity constraints, non Euclidean distances, transductive	Any pairwise distance
DBSCAN	neighborhood size	Very large n_samples, medium n_clusters	Non-flat geometry, uneven cluster sizes, outlier removal, transductive	Distances between nearest points
HDBSCAN	minimum cluster membership, minimum point neighbors	large n_samples, medium n_clusters	Non-flat geometry, uneven cluster sizes, outlier removal, transductive, hierarchical, variable cluster density	Distances between nearest points
OPTICS	minimum cluster membership	Very large n_samples, large n_clusters	Non-flat geometry, uneven cluster sizes, variable cluster density, outlier removal, transductive	Distances between points
Gaussian mixtures	many	Not scalable	Flat geometry, good for density estimation, inductive	Mahalanobis distances to centers
BIRCH	branching factor, threshold, optional global clusterer.	Large n_clusters and n_samples	Large dataset, outlier removal, data reduction, inductive	Euclidean distance between points
Bisecting K-Means	number of clusters	Very large n_samples, medium n_clusters	General-purpose, even cluster size, flat geometry, no empty clusters, inductive, hierarchical	Distances between points

Dimensional Reduction#

Dimensional reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data. Reducing dimensions can be helpful because of the curse of dimensionality and the sparsity of data in higher dimensions making it either computationally infeasible or perhaps impossible to run standard algorithms. It can also be helpful to summarise and visualise high-dimensional information in a continuous way (rather than the discrete way offered by clustering).

Factor analysis is one type of dimensional reduction that may already be familiar to you as an economist. This is available in scikit-learn, along with a range of other reduction algorithms.

In the rest of this chapter, we’ll showcase just two: principal component analysis and UMAP.

Principal Component Analysis#

The absolute bread and butter of dimensional reduction algorithms is principal component analysis, so we’ll start with that as a good example of how to use dimensional reduction algorithms in general with scikit-learn-like APIs. We will project the 4-dimensional Iris data down into two dimensions using PCA.

from sklearn.datasets import load_iris
from sklearn.decomposition import PCA

iris = load_iris()
X_iris, y_iris = iris.data, iris.target
target_names_iris = iris.feature_names
pca_iris = PCA(n_components=2)
print(f"The dimensionality of the raw data is: {X_iris.shape}")
X_dim_red = pca_iris.fit_transform(X_iris)
print(
    f"explained variance ratio (1st two components): {(pca_iris.explained_variance_ratio_)}"
)
X_dim_red.shape

The dimensionality of the raw data is: (150, 4)
explained variance ratio (1st two components): [0.92461872 0.05306648]

(150, 2)

colors = ["navy", "turquoise", "darkorange"]
lw = 2

fig, ax = plt.subplots()
for color, i, target_name in zip(colors, [0, 1, 2], target_names_iris):
    ax.scatter(
        X_dim_red[y_iris == i, 0],
        X_dim_red[y_iris == i, 1],
        color=color,
        alpha=0.8,
        lw=lw,
        label=target_name,
    )
ax.legend(loc="best", shadow=False, scatterpoints=1)
ax.set_title("PCA of IRIS dataset")
plt.show()

_images/013b8c8343eac69517a905592279fc603b9213c2b0bb1361b8cf00b553bb71df.svg

You can see here that we have projected out onto two principal dimensions (axes) and that there is some good separation of points.

UMAP, or Uniform Manifold Approximation and Projection for Dimension Reduction#

UMAP is a relatively recent algorithm [McInnes, Healy, and Melville, 2018] that is somewhat similar to t-SNE (t-distributed Stochastic Neighbor Embedding). Both have issues with reproducing structure and should be used with care but both can be quite impresssive with the right parameters. t-SNE is included in scikit-learn, but for UMAP, you’ll need to install the umap-learn package.

We will use the penguins dataset to demo UMAP (following the documentation).

penguins = pd.read_csv(
    "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv"
)
penguins["sex"] = penguins["sex"].str.title()
penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
3	Adelie	Torgersen	NaN	NaN	NaN	NaN	NaN
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female

As ever, it does help to clean up the data. We will drop NAs and convert each feature into z-scores (number of standard deviations from the mean) for comparability. We will also only use the numeric columns.

import umap

reducer = umap.UMAP()
penguins_data = penguins.dropna(how="any").loc[
    :,
    [
        "bill_length_mm",
        "bill_depth_mm",
        "flipper_length_mm",
        "body_mass_g",
    ],
]
scaled_penguin_data = StandardScaler().fit_transform(penguins_data)

Let’s now create the embedding using UMAP:

embedding = reducer.fit_transform(scaled_penguin_data)
embedding.shape

(333, 2)

And plot the results

species = list(penguins["species"].unique())
species_map = {i: x for i, x in zip(range(len(species)), species)}

fig, ax = plt.subplots()
for i, spec_name in species_map.items():
    this_cut = penguins.dropna(how="any")["species"].astype("category").cat.codes == i
    ax.scatter(
        embedding[this_cut, 0],
        embedding[this_cut, 1],
        c=plt.rcParams["axes.prop_cycle"].by_key()["color"][i],
        label=spec_name,
    )
ax.set_title("UMAP projection of the Penguins dataset")
ax.legend()
plt.show()

_images/abd1f6964583e7c13d1dccbb33577e6eb02ad4f2e22f0851a7e358d3f1214cd1.svg

This does a useful job of capturing the structure of the data, which you can check for yourself by looking at a pairplot of all of the numerical fields.

While the benefits of dimensional reduction are small for such low dimensional datasets, as soon as you are dealing with 6 or more dimensions, perhaps as many as thousands, the benefits become larger.