Regression and Classification Metrics

Objectives

  • Compare and contrast regression tasks and classification tasks, and give examples of each
  • Identify two different ways of measuring accuracy for regression and for classification
  • Identify several reasons why a model may predict better on some subsets of data than others

Types of Tasks

  • regression: predict a number (“continuous”)
    • number should be “close” in some sense to the correct number
  • classification: predict a category
    • which one of these two groups? three groups? 500,000 groups?
    • could ask: “how likely is it to be in group i

Are these tasks regression or classification?

  1. Is this a picture of the inside or outside of the restaurant?
  2. How much will it rain in GR next year?
  3. Is this person having a seizure?
  4. How much will this home sell for?
  5. How much time will this person spend watching this video?
  6. How big a fruit will this plant produce?
  7. Which word did this person mean to type?
  8. Will this person “Like” this post?

Ames Housing Example

Regression: housing prices in Ames, Iowa. Details:

What makes a good prediction? Regression

We predicted the home would sell for $250k. It sold for $200k. Is that good?

  • residual (error): actual minus predicted
    If home sold for $200k but we predicted $250k, residual is __
  • absolute error: $50k
  • percent error: 25%

Across the entire dataset:

  • average error: do we tend to predict too high? too low?
  • MAE: Mean Absolute Error (“predictions are usually off by $xxx”)
  • MAPE: Mean Absolute Percent Error (“predictions are usually off by yy%”)

Other kinds of errors

  • max absolute error
  • mean squared error (MSE)
    • emphasizes large errors
    • de-emphasizes small errors
    • often easier to work with mathematically
    • units are squared
  • normalized squared error: MSE / Variance
    • “R2” = 1 - normalized squared error
  • RMSE: Root Mean Squared Error
    • square root of MSE
    • units are the same as the original data
    • intuition: like a standard deviation

Seizure classification

First FDA-approved AI-powered medical device: Empatica Embrace2, company founded by MIT data scientist Rosalind Picard

What makes a good prediction? Classification

Suppose: every second, the armband decides whether a seizure is occurring


The child was perfectly fine but our armband flagged a seizure. Is that good?


The child was having a seizure but our armband didn’t flag it. Is that good?

Confusion Matrices for Classifiers

Seizure prediction example:

Seizure predicted No seizure predicted
Seizure happened True positive False negative (Type 1 error)
No seizure happened False positive (Type 2 error) True negative

In general:

Predicted positive Predicted negative
Actually positive True positive (TP) False negative (FN)
Actually negative False positive (FP) True negative (TN)
  • Accuracy (% correct) = (TP + TN) / (# predictions made)

  • False negative (“miss”) rate = FN / (# actual positives)

    • aka “Type 1 error”
    • 1 - False negative rate = “recall” or “sensitivity”
  • False positive (“false alarm”) rate = FP / (# true negatives)

    • 1 - False positive rate = “specificity”
  • Wikipedia article

Precision and Recall

Predicted positive Predicted negative
Actually positive True positive (TP) False negative (FN)
Actually negative False positive (FP) True negative (TN)
  • Precision = TP / (# predicted positives)
    • can we trust the model when it says “seizure”?
  • Recall = TP / (# actual positives)
    • aka True Positive Rate
    • aka “sensitivity”
    • can we trust the model to catch all seizures?

If you were designing a seizure alert system, would you want precision and recall to be high or low? What are the trade-offs associated with each decision?

Validation

Key point: you must evaluate predictions on unseen data

Failure to generalize

Predictive models almost always do better on the data they’re trained on than anything else.

Why?

  • model uses a pattern that only held by chance
  • model uses a pattern that only holds for some data
  • model uses a pattern that’s real but got a fuzzy picture of it

How can we accurately assess our models?

General strategy: hold out data.