Regression and Classification Metrics

Objectives

Compare and contrast regression tasks and classification tasks, and give examples of each
Identify two different ways of measuring accuracy for regression and for classification
Identify several reasons why a model may predict better on some subsets of data than others

Types of Tasks

regression: predict a number (“continuous”)
- number should be “close” in some sense to the correct number
classification: predict a category
- which one of these two groups? three groups? 500,000 groups?
- could ask: “how likely is it to be in group i”

Are these tasks regression or classification?

Is this a picture of the inside or outside of the restaurant?
How much will it rain in GR next year?
Is this person having a seizure?
How much will this home sell for?
How much time will this person spend watching this video?
How big a fruit will this plant produce?
Which word did this person mean to type?
Will this person “Like” this post?

Ames Housing Example

Regression: housing prices in Ames, Iowa. Details:

What makes a good prediction? Regression

We predicted the home would sell for $250k. It sold for $200k. Is that good?

residual (error): actual minus predicted
If home sold for $200k but we predicted $250k, residual is __
absolute error: $50k
percent error: 25%

Across the entire dataset:

average error: do we tend to predict too high? too low?
MAE: Mean Absolute Error (“predictions are usually off by $xxx”)
MAPE: Mean Absolute Percent Error (“predictions are usually off by yy%”)

Other kinds of errors

max absolute error
mean squared error (MSE)
- emphasizes large errors
- de-emphasizes small errors
- often easier to work with mathematically
- units are squared
normalized squared error: MSE / Variance
- “R2” = 1 - normalized squared error
RMSE: Root Mean Squared Error
- square root of MSE
- units are the same as the original data
- intuition: like a standard deviation

Seizure classification

First FDA-approved AI-powered medical device: Empatica Embrace2, company founded by MIT data scientist Rosalind Picard

What makes a good prediction? Classification

Suppose: every second, the armband decides whether a seizure is occurring

The child was perfectly fine but our armband flagged a seizure. Is that good?

The child was having a seizure but our armband didn’t flag it. Is that good?

Confusion Matrices for Classifiers

Seizure prediction example:

	Seizure predicted	No seizure predicted
Seizure happened	True positive	False negative (Type 1 error)
No seizure happened	False positive (Type 2 error)	True negative

In general:

	Predicted positive	Predicted negative
Actually positive	True positive (TP)	False negative (FN)
Actually negative	False positive (FP)	True negative (TN)

Accuracy (% correct) = (TP + TN) / (# predictions made)
False negative (“miss”) rate = FN / (# actual positives)
- aka “Type 1 error”
- 1 - False negative rate = “recall” or “sensitivity”
False positive (“false alarm”) rate = FP / (# true negatives)
- 1 - False positive rate = “specificity”
Wikipedia article

Precision and Recall

	Predicted positive	Predicted negative
Actually positive	True positive (TP)	False negative (FN)
Actually negative	False positive (FP)	True negative (TN)

Precision = TP / (# predicted positives)
- can we trust the model when it says “seizure”?
Recall = TP / (# actual positives)
- aka True Positive Rate
- aka “sensitivity”
- can we trust the model to catch all seizures?

If you were designing a seizure alert system, would you want precision and recall to be high or low? What are the trade-offs associated with each decision?

Validation

Key point: you must evaluate predictions on unseen data

Failure to generalize

Predictive models almost always do better on the data they’re trained on than anything else.

Why?

model uses a pattern that only held by chance
model uses a pattern that only holds for some data
model uses a pattern that’s real but got a fuzzy picture of it

How can we accurately assess our models?

General strategy: hold out data.