Unsupervised Machine Learning#
Introduction#
In this chapter, we’re going to look at unsupervised machine learning. We’ll be looking at clustering and dimensional reduction in particular. We’ll be leaning heavily on scikit-learn. Also in the category of unsupervised machine learning, but not covered here, are anomaly detection and outlier detection.
Before we get cracking on the chapter content, let’s do our normal set of imports.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
random_state = 42 # We'll use this throughout to make this page reproducible
prng = np.random.default_rng(random_state)
Clustering#
To cluster data is to use a set of high-dimensional features to create a grouping wherein data points in the same cluster are more similar to each other than data points in distinct clusters.
When might you wish to cluster? You can use clusters to get informative fixed effects based on other data. You can use clustering to create features for (supervised) machine learning. You can use clusters to help you understand different types of (groups of) data in your sample. Some more applied examples are: market segmentation, image segmentation, and anomaly detection.
As with much of the API, there are two ways to use the clustering algorithms found in the scikit-learn package: the first implements the fit method to learn clusters on training data; the second is a function, that, given training data, returns an array of integer labels corresponding to the different clusters. As ever, there are fit
, fit_tranform
, and transform
methods so that you can create clusters, create and allocate data to clusters, or just allocate data to existing clusters, respectively.
An example: K-means#
To help illustrate exactly what’s going on in clustering, let’s look at a specific example using the K-means algorithm. K-means creates clusters by trying to separate samples in \(K\) groups of equal variance, minimising a criterion known as the inertia or within-cluster sum-of-squares:
where a set of \(N\) samples is sorted into \(K\) disjoint clusters \(C\) each described by the mean point \(\mu_j\) of the samples in that cluster. The means are commonly called the cluster “centroids”; note that they are not, in general, points from the input data, \(X\).
Let’s see an example of K-means in action. First, let’s generate some data
from sklearn.datasets import make_blobs
batch_size = 45
centers = [[1, 1], [-1, -1], [1, -1]]
n_clusters = len(centers)
X_to_cluster, labels_true = make_blobs(n_samples=3000, centers=centers, cluster_std=0.7, random_state=random_state)
X_to_cluster
array([[ 1.56876808, 1.94936802],
[ 1.47695363, -1.8075846 ],
[ 1.25807132, 0.72466283],
...,
[-0.7031532 , -0.97238707],
[ 0.79388097, 0.01692888],
[ 0.60502668, 1.12918591]])
Now we fit K-means
from sklearn.cluster import KMeans
k_means = KMeans(init="k-means++", n_clusters=3, n_init=10)
k_means.fit(X_to_cluster)
KMeans(n_clusters=3, n_init=10)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, n_init=10)
And plot the results:
Clustering algorithms in scikit-learn#
There are too many clustering algorithms just in scikit-learn to cover them in any real detail so instead we will simply reproduce this ridiculously useful chart and table that show all of the package’s clustering algorithms and their strengths and weaknesses.
Method name |
Parameters |
Scalability |
Usecase |
Geometry (metric used) |
---|---|---|---|---|
K-Means |
number of clusters |
Very large n_samples, medium n_clusters with MiniBatch code |
General-purpose, even cluster size, flat geometry, not too many clusters, inductive |
Distances between points |
Affinity propagation |
damping, sample preference |
Not scalable with n_samples |
Many clusters, uneven cluster size, non-flat geometry, inductive |
Graph distance (e.g. nearest-neighbor graph) |
Mean-shift |
bandwidth |
Not scalable with n_samples |
Many clusters, uneven cluster size, non-flat geometry, inductive |
Distances between points |
Spectral clustering |
number of clusters |
Medium n_samples, small n_clusters |
Few clusters, even cluster size, non-flat geometry, transductive |
Graph distance (e.g. nearest-neighbor graph) |
Ward hierarchical clustering |
number of clusters or distance threshold |
Large n_samples and n_clusters |
Many clusters, possibly connectivity constraints, transductive |
Distances between points |
Agglomerative clustering |
number of clusters or distance threshold, linkage type, distance |
Large n_samples and n_clusters |
Many clusters, possibly connectivity constraints, non Euclidean distances, transductive |
Any pairwise distance |
DBSCAN |
neighborhood size |
Very large n_samples, medium n_clusters |
Non-flat geometry, uneven cluster sizes, outlier removal, transductive |
Distances between nearest points |
HDBSCAN |
minimum cluster membership, minimum point neighbors |
large n_samples, medium n_clusters |
Non-flat geometry, uneven cluster sizes, outlier removal, transductive, hierarchical, variable cluster density |
Distances between nearest points |
OPTICS |
minimum cluster membership |
Very large n_samples, large n_clusters |
Non-flat geometry, uneven cluster sizes, variable cluster density, outlier removal, transductive |
Distances between points |
Gaussian mixtures |
many |
Not scalable |
Flat geometry, good for density estimation, inductive |
Mahalanobis distances to centers |
BIRCH |
branching factor, threshold, optional global clusterer. |
Large n_clusters and n_samples |
Large dataset, outlier removal, data reduction, inductive |
Euclidean distance between points |
Bisecting K-Means |
number of clusters |
Very large n_samples, medium n_clusters |
General-purpose, even cluster size, flat geometry, no empty clusters, inductive, hierarchical |
Distances between points |
Dimensional Reduction#
Dimensional reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data. Reducing dimensions can be helpful because of the curse of dimensionality and the sparsity of data in higher dimensions making it either computationally infeasible or perhaps impossible to run standard algorithms. It can also be helpful to summarise and visualise high-dimensional information in a continuous way (rather than the discrete way offered by clustering).
Factor analysis is one type of dimensional reduction that may already be familiar to you as an economist. This is available in scikit-learn, along with a range of other reduction algorithms.
In the rest of this chapter, we’ll showcase just two: principal component analysis and UMAP.
Principal Component Analysis#
The absolute bread and butter of dimensional reduction algorithms is principal component analysis, so we’ll start with that as a good example of how to use dimensional reduction algorithms in general with scikit-learn-like APIs. We will project the 4-dimensional Iris data down into two dimensions using PCA.
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
target_names_iris = iris.feature_names
pca_iris = PCA(n_components=2)
print(f"The dimensionality of the raw data is: {X_iris.shape}")
X_dim_red = pca_iris.fit_transform(X_iris)
print(f"explained variance ratio (1st two components): {(pca_iris.explained_variance_ratio_)}")
X_dim_red.shape
The dimensionality of the raw data is: (150, 4)
explained variance ratio (1st two components): [0.92461872 0.05306648]
(150, 2)
colors = ["navy", "turquoise", "darkorange"]
lw = 2
fig, ax = plt.subplots()
for color, i, target_name in zip(colors, [0, 1, 2], target_names_iris):
ax.scatter(
X_dim_red[y_iris == i, 0], X_dim_red[y_iris == i, 1], color=color, alpha=0.8, lw=lw, label=target_name
)
ax.legend(loc="best", shadow=False, scatterpoints=1)
ax.set_title("PCA of IRIS dataset")
plt.show()
You can see here that we have projected out onto two principal dimensions (axes) and that there is some good separation of points.
UMAP, or Uniform Manifold Approximation and Projection for Dimension Reduction#
UMAP is a relatively recent algorithm [McInnes, Healy, and Melville, 2018] that is somewhat similar to t-SNE (t-distributed Stochastic Neighbor Embedding). Both have issues with reproducing structure and should be used with care but both can be quite impresssive with the right parameters. t-SNE is included in scikit-learn, but for UMAP, you’ll need to install the umap-learn package.
We will use the penguins dataset to demo UMAP (following the documentation).
penguins = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv")
penguins["sex"] = penguins["sex"].str.title()
penguins.head()
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex | |
---|---|---|---|---|---|---|---|
0 | Adelie | Torgersen | 39.1 | 18.7 | 181.0 | 3750.0 | Male |
1 | Adelie | Torgersen | 39.5 | 17.4 | 186.0 | 3800.0 | Female |
2 | Adelie | Torgersen | 40.3 | 18.0 | 195.0 | 3250.0 | Female |
3 | Adelie | Torgersen | NaN | NaN | NaN | NaN | NaN |
4 | Adelie | Torgersen | 36.7 | 19.3 | 193.0 | 3450.0 | Female |
As ever, it does help to clean up the data. We will drop NAs and convert each feature into z-scores (number of standard deviations from the mean) for comparability. We will also only use the numeric columns.
import umap
reducer = umap.UMAP()
penguins_data = (
penguins
.dropna(how="any")
.loc[:,
[
"bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g",
]]
)
scaled_penguin_data = StandardScaler().fit_transform(penguins_data)
Let’s now create the embedding using UMAP:
embedding = reducer.fit_transform(scaled_penguin_data)
embedding.shape
(333, 2)
And plot the results
species = list(penguins["species"].unique())
species_map = {i: x for i, x in zip(range(len(species)), species)}
fig, ax = plt.subplots()
for i, spec_name in species_map.items():
this_cut = penguins.dropna(how="any")["species"].astype("category").cat.codes == i
ax.scatter(
embedding[this_cut, 0],
embedding[this_cut, 1],
c=plt.rcParams['axes.prop_cycle'].by_key()['color'][i],
label=spec_name,
)
ax.set_title("UMAP projection of the Penguins dataset")
ax.legend()
plt.show()
This does a useful job of capturing the structure of the data, which you can check for yourself by looking at a pairplot of all of the numerical fields.
While the benefits of dimensional reduction are small for such low dimensional datasets, as soon as you are dealing with 6 or more dimensions, perhaps as many as thousands, the benefits become larger.