14.1 Unsupervised Learning

Unsupervised Learning

So far we have been doing supervised learning, where have a target we’re trying to predict.
- “How much will these homes sell for?”
- “How long will this person spend watching this video?”
Unsupervised learning works when we don’t have an exact target to predict, or we want to explore relationships in the data.
- “What general types of homes are on the market right now?”
- “What are some different segments of our customer base?”
- “Are there distinct types of Covid-19 symptoms?”
Clustering is one very common type of unsupervised learning.

Clustering

Goal: put observations into groups

Those in the same group should be similar to each other
Those in different groups should be different.

Crucial questions:

How many groups?
How do we define “similar” / “different”?

Artwork by @allison_horst

Many types of clustering algorithms

Source: sklearn documentation

Some differences between clustering algorithms

Do we need to specify number of clusters?
Do clusters have to have specific shapes?
Does every observation have to be in a cluster (or can there be “outliers”)?
Does every point have to be in exactly one cluster (or can there be “fuzzy” clusters)? (“hard” vs “soft” clustering)
How fast is it? (does it scale to large datasets?)

Impact of distance metric

What if two items are close in one dimension, but far in another?