Exercise 14: Clustering

The goal of this exercise is to practice with clustering. Objectives:

Identify application areas for clustering
Contrast supervised learning with unsupervised learning
Predict the effect of changes in distance metrics on clustering results

This exercise is not graded.

Getting started

We’ll be working with the Ames home sale data again.

Make a Quarto notebook for this exercise. Here are the imports you’ll need.

```{python}
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import pandas as pd

import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"
```

Wrangle Data

We’ll be using a dataset of home sales in Ames, Iowa. Each row is a home that was sold. The dataset includes lots of information about each home, such as its size, the year it was built, and its location. It also includes the sale price of each home.

Here is the data dictionary that the author provided, and the academic paper that describes the dataset.

Download the dataset from this link and put it in the data folder. Then read it into a Pandas DataFrame. Look at the .info() and head as usual, although you may not want to include them in your final report because there are so many columns.

data = (
    pd.read_csv("../../static/data/ames/ames_home_sales.csv")
    .query("Gr_Liv_Area < 4000 and Sale_Condition == 'Normal'")
    .rename(columns={"Sale_Price": "sale_price"})
    .assign(sale_price=lambda df: df.sale_price / 1000)
    .copy()
)

Split the data

Let’s do a train-test split again as usual. It’s not as important for unsupervised analysis, but if we get an idea about some pattern in the data and want to check whether it’s real, it’ll be helpful to have data we haven’t peeked at. We can get away with a smaller test set, though (so use 4/5 for training).

ames_train, ames_test = train_test_split(data, test_size=0.2, random_state=123)

Clustering

We first need to define what data we want the clustering to use. Why might it not make sense to use all of the data?

Some kinds of differences are more interesting or important than others!
The numeric columns are all on different scales.
We might or might not want to look at an “outcome” column like Sale_Price.
Many columns are categorical, so we’d need to define what “distance” means for those.

etc. So let’s define the data we want to use for clustering manually.

We’ll start with location (latitude and longitude), but we’ll revisit this decision many times.

data_for_clustering = ames_train[["Latitude", "Longitude"]]

Now we cluster the data. We’ll set a random seed so that the random initialization is reproducible (although with 10 random restarts it’s unlikely to make a difference in practice).

We’ll ask for 3 clusters for now. Again, we’ll come back and revisit this later.

# Note: if (AND ONLY IF) this fails with 
#   AttributeError: 'NoneType' object has no attribute 'split'
# then upgrade threadpoolctl:
#   pip install -U threadpoolctl
# see also:
#   https://github.com/scikit-learn/scikit-learn/issues/24238
# (the RStudio server is not affected.)
kmeans = (
    KMeans(n_clusters=3, random_state=123, n_init=10)
    .fit(data_for_clustering.values)
)

Finally, let’s add the cluster assignment back to the original data so we can visualize.

def stringify_cluster_labels(labels):
    return [f"Cluster {x}" for x in labels]
ames_train['cluster'] = stringify_cluster_labels(kmeans.labels_)

Visualize

Let’s visualize the clusters. We’ll use a scatterplot of the data, with the points colored by cluster.

# Plot 1
px.scatter(
    ames_train,
    x="Longitude",
    y="Latitude",
    color="cluster",
    opacity=0.5,
)

# Plot 2
px.scatter(
    ames_train,
    x="Gr_Liv_Area",
    y="Year_Built",
    color="cluster",
    opacity=0.5,
)

Exploring Parameter Settings

Your turn:

What differences do you notice between the clustering as shown in Plot 1 and the (same) clustering as shown in Plot 2?
Try increasing n_clusters. What changes about both plots?
Use only Year_Built for clustering (removing latitude and longitude). What can you say about the age of homes in different parts of town?
Try clustering using Latitude, Longitude, Gr_Liv_Area. What changes about both plots? Why are they different?
Try scaling Gr_Liv_Area to have a maximum of 1—by dividing its values by the maximum. (You can use a MinMaxScaler for this if you want.) What changes about both plots? Why?
Try adding scaling for Latitude (but not Longitude). What changes and why?
Now add scaling for for Longitude. What changes and why?
Try changing the maximum to 10 for Gr_Liv_Area. Then try 0.1. What changes and why?
Try adding Year_Built.

Relating to sale price

Do the patterns captured by these clusters also happen to relate to sale price?

Make a plot of sale price by cluster. What do you notice?