```{python}
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"
```
Exercise 14: Clustering
The goal of this exercise is to practice with clustering. Objectives:
- Identify application areas for clustering
- Contrast supervised learning with unsupervised learning
- Predict the effect of changes in distance metrics on clustering results
This exercise is not graded.
Getting started
We’ll be working with the Ames home sale data again.
Make a Quarto notebook for this exercise. Here are the imports you’ll need.
Wrangle Data
We’ll be using a dataset of home sales in Ames, Iowa. Each row is a home that was sold. The dataset includes lots of information about each home, such as its size, the year it was built, and its location. It also includes the sale price of each home.
Here is the data dictionary that the author provided, and the academic paper that describes the dataset.
Download the dataset from this link and put it in the data
folder. Then read it into a Pandas DataFrame. Look at the .info()
and head
as usual, although you may not want to include them in your final report because there are so many columns.
= (
data "../../static/data/ames/ames_home_sales.csv")
pd.read_csv("Gr_Liv_Area < 4000 and Sale_Condition == 'Normal'")
.query(={"Sale_Price": "sale_price"})
.rename(columns=lambda df: df.sale_price / 1000)
.assign(sale_price
.copy() )
Split the data
Let’s do a train-test split again as usual. It’s not as important for unsupervised analysis, but if we get an idea about some pattern in the data and want to check whether it’s real, it’ll be helpful to have data we haven’t peeked at. We can get away with a smaller test set, though (so use 4/5 for training).
= train_test_split(data, test_size=0.2, random_state=123) ames_train, ames_test
Clustering
We first need to define what data we want the clustering to use. Why might it not make sense to use all of the data?
- Some kinds of differences are more interesting or important than others!
- The numeric columns are all on different scales.
- We might or might not want to look at an “outcome” column like
Sale_Price
. - Many columns are categorical, so we’d need to define what “distance” means for those.
etc. So let’s define the data we want to use for clustering manually.
We’ll start with location (latitude and longitude), but we’ll revisit this decision many times.
= ames_train[["Latitude", "Longitude"]] data_for_clustering
Now we cluster the data. We’ll set a random seed so that the random initialization is reproducible (although with 10 random restarts it’s unlikely to make a difference in practice).
We’ll ask for 3 clusters for now. Again, we’ll come back and revisit this later.
# Note: if (AND ONLY IF) this fails with
# AttributeError: 'NoneType' object has no attribute 'split'
# then upgrade threadpoolctl:
# pip install -U threadpoolctl
# see also:
# https://github.com/scikit-learn/scikit-learn/issues/24238
# (the RStudio server is not affected.)
= (
kmeans =3, random_state=123, n_init=10)
KMeans(n_clusters
.fit(data_for_clustering.values) )
Finally, let’s add the cluster assignment back to the original data so we can visualize.
def stringify_cluster_labels(labels):
return [f"Cluster {x}" for x in labels]
'cluster'] = stringify_cluster_labels(kmeans.labels_) ames_train[
Visualize
Let’s visualize the clusters. We’ll use a scatterplot of the data, with the points colored by cluster.
# Plot 1
px.scatter(
ames_train,="Longitude",
x="Latitude",
y="cluster",
color=0.5,
opacity )
# Plot 2
px.scatter(
ames_train,="Gr_Liv_Area",
x="Year_Built",
y="cluster",
color=0.5,
opacity )
Exploring Parameter Settings
Your turn:
- What differences do you notice between the clustering as shown in Plot 1 and the (same) clustering as shown in Plot 2?
- Try increasing
n_clusters
. What changes about both plots? - Use only
Year_Built
for clustering (removing latitude and longitude). What can you say about the age of homes in different parts of town? - Try clustering using
Latitude, Longitude, Gr_Liv_Area
. What changes about both plots? Why are they different? - Try scaling
Gr_Liv_Area
to have a maximum of 1—by dividing its values by the maximum. (You can use aMinMaxScaler
for this if you want.) What changes about both plots? Why? - Try adding scaling for
Latitude
(but notLongitude
). What changes and why? - Now add scaling for for
Longitude
. What changes and why? - Try changing the maximum to
10
forGr_Liv_Area
. Then try0.1
. What changes and why? - Try adding
Year_Built
.
Relating to sale price
Do the patterns captured by these clusters also happen to relate to sale price?
Make a plot of sale price by cluster. What do you notice?