Regression and Classification Metrics
Objectives
- Compare and contrast regression tasks and classification tasks, and give examples of each
- Identify two different ways of measuring accuracy for regression and for classification
- Identify several reasons why a model may predict better on some subsets of data than others
Types of Tasks
- regression: predict a number (“continuous”)
- number should be “close” in some sense to the correct number
- classification: predict a category
- which one of these two groups? three groups? 500,000 groups?
- could ask: “how likely is it to be in group i”
Ames Housing Example
Regression: housing prices in Ames, Iowa. Details:
What makes a good prediction? Regression
We predicted the home would sell for $250k. It sold for $200k. Is that good?
- residual (error): actual minus predicted
If home sold for $200k but we predicted $250k, residual is __
- absolute error: $50k
- percent error: 25%
Across the entire dataset:
- average error: do we tend to predict too high? too low?
- MAE: Mean Absolute Error (“predictions are usually off by $xxx”)
- MAPE: Mean Absolute Percent Error (“predictions are usually off by yy%”)
Other kinds of errors
- max absolute error
- mean squared error (MSE)
- emphasizes large errors
- de-emphasizes small errors
- often easier to work with mathematically
- units are squared
- normalized squared error: MSE / Variance
- “R2” = 1 - normalized squared error
- RMSE: Root Mean Squared Error
- square root of MSE
- units are the same as the original data
- intuition: like a standard deviation
Seizure classification
First FDA-approved AI-powered medical device: Empatica Embrace2, company founded by MIT data scientist Rosalind Picard
What makes a good prediction? Classification
Suppose: every second, the armband decides whether a seizure is occurring
The child was perfectly fine but our armband flagged a seizure. Is that good?
The child was having a seizure but our armband didn’t flag it. Is that good?
Confusion Matrices for Classifiers
Seizure prediction example:
Seizure happened |
True positive |
False negative (Type 1 error) |
No seizure happened |
False positive (Type 2 error) |
True negative |
In general:
Actually positive |
True positive (TP) |
False negative (FN) |
Actually negative |
False positive (FP) |
True negative (TN) |
Accuracy (% correct) = (TP + TN) / (# predictions made)
False negative (“miss”) rate = FN / (# actual positives)
- aka “Type 1 error”
- 1 - False negative rate = “recall” or “sensitivity”
False positive (“false alarm”) rate = FP / (# true negatives)
- 1 - False positive rate = “specificity”
Wikipedia article
Precision and Recall
Actually positive |
True positive (TP) |
False negative (FN) |
Actually negative |
False positive (FP) |
True negative (TN) |
- Precision = TP / (# predicted positives)
- can we trust the model when it says “seizure”?
- Recall = TP / (# actual positives)
- aka True Positive Rate
- aka “sensitivity”
- can we trust the model to catch all seizures?
If you were designing a seizure alert system, would you want precision and recall to be high or low? What are the trade-offs associated with each decision?
Validation
Key point: you must evaluate predictions on unseen data
Failure to generalize
Predictive models almost always do better on the data they’re trained on than anything else.
- model uses a pattern that only held by chance
- model uses a pattern that only holds for some data
- model uses a pattern that’s real but got a fuzzy picture of it
How can we accurately assess our models?
General strategy: hold out data.