14.1 Unsupervised Learning

Unsupervised Learning

  • So far we have been doing supervised learning, where have a target we’re trying to predict.
    • “How much will these homes sell for?”
    • “How long will this person spend watching this video?”
  • Unsupervised learning works when we don’t have an exact target to predict, or we want to explore relationships in the data.
  • Clustering is one very common type of unsupervised learning.

Clustering

Goal: put observations into groups

  • Those in the same group should be similar to each other
  • Those in different groups should be different.

Crucial questions:

  • How many groups?
  • How do we define “similar” / “different”?

Artwork by @allison_horst

Many types of clustering algorithms

Source: sklearn documentation

Some differences between clustering algorithms

  • Do we need to specify number of clusters?
  • Do clusters have to have specific shapes?
  • Does every observation have to be in a cluster (or can there be “outliers”)?
  • Does every point have to be in exactly one cluster (or can there be “fuzzy” clusters)? (“hard” vs “soft” clustering)
  • How fast is it? (does it scale to large datasets?)

Impact of distance metric

  • What if two items are close in one dimension, but far in another?