Exercise 10: Thresholds and Metrics

The goal of this exercise is to practice with metrics of classifier performance.

Getting started

We’ll start today with an interactive activity.

Attacking discrimination with smarter machine learning

We’ll focus today on the one-group example; we’ll return to the part about blue vs orange groups later.

Explore

Spend about 10 minutes playing with the “Threshold Decision” demonstration. Discuss the following questions with your partner:

What are the “scores”? What real-life concept does this capture? (Do you know your score?)
Why might a bank want to use “score” to decide whether to grant a loan? (Why don’t banks grant all loan applications? Why do they ever grant loans?)
What sort of predictions are being made here? What constitutes a “correct” prediction?
Slowly sweep the threshold from 0 to 100. On scrap paper, sketch how Correct, True Positive Rate, and Positive Rate change as the threshold changes.

Implement

Now let’s implement those metrics ourselves to check our understanding.

We’ve made a dataset that approximately mimics the dataset from the article. Download people_onegroup.csv and put it in your data folder.

import pandas as pd
import numpy as np
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"

people = pd.read_csv("data/people_onegroup.csv")
people

	score	repay
0	37	True
1	39	True
2	41	True
3	43	True
4	44	True
...	...	...
193	56	False
194	58	False
195	59	False
196	61	False
197	63	False

198 rows × 2 columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 198 entries, 0 to 197
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   score   198 non-null    int64
 1   repay   198 non-null    bool 
dtypes: bool(1), int64(1)
memory usage: 1.9 KB

The article includes dotplots of the people. We can approximate that using a histogram.

px.histogram(people, x="score", color="repay", nbins=100)

Part 1: EDA

Part 1a: Counts

How many people repayed? What fraction of them did?

Try doing this in two ways. First, use the usual grouping and counting pattern, but add a column for the fraction of people in each group. (Divide the size column by the total number of people in the dataset)

	repay	size	fraction
0	False	99	0.5
1	True	99	0.5

Alternative way: think about what this does:

people['repay'].sum()

What if we changed .sum to .mean? (I call this the “sum-as-count pattern”.)

Part 1b: Mean scores

What was the mean score for people who repayed? What was the mean score for people who didn’t?

	repay	score
0	False	40.010101
1	True	60.000000

Part 2: Thresholds

On the website, pick a threshold that results in all 4 colors being visible; it’s especially visible in the Positive Rate pie.
Assign that threshold to a variable in your qmd.
Add a new column, granted, to the people dataframe that indicates if the bank grants the loan. Like the website, grant a loan if the score exceeds the threshold.
How many loans were granted?

Here’s an example for a threshold of 64.

	score	repay	granted
0	37	True	False
1	39	True	False
2	41	True	False
3	43	True	False
4	44	True	False

Part 3: Metrics

We’ll start by making a confusion matrix for the classifier. You’ll find the following helpful (again, I’m showing the result for a threshold of 64):

crosstab = pd.crosstab(people['granted'], people['repay'])
crosstab

repay	False	True
granted
False	99	67
True	0	32

Now pull out which is which:

(a, b), (c, d) = crosstab.values
print(f"a: {a}, b: {b}, c: {c}, d: {d}")

a: 99, b: 67, c: 0, d: 32

In your document, replace a, b, c, and d with names like false_positive and true_negative (or fp and tn).

Compute and show the Positive Rate, the True Positive Rate, and Correctness. Check all of your results against the webapp; the numbers should be close although they may differ slightly because the data was not constructed identically (and because the webapp doesn’t show the full precision of the threshold or the metrics).

Positive Rate: 0.162
True Positive Rate: 0.323
Correctness: 0.662

Compute and show the precision and recall of this classifier at the threshold you’ve selected.

Part 4: Trade-offs

Adjust the threshold to maximize the Correct rate. What is the True Positive Rate then?

What trade-off do we have to make if we want to maximize True Positive Rate instead?

Give specific examples of thresholds that achieve these objectives.