Introduction to Supervised Learning

Day 1: Introduction

Think of examples

How is data used to make decisions that affect you or your neighbors?

Use whatever resources you like.

Predictive Analytics

  • A powerful tool to turn data into action.
  • It works because God made the universe predictable (and successful prediction rewarding), and promised order:
    • covenant with Noah that the seasons would continue (Gen 8:22)
    • guarantee in Psalms that the sun would rise (Ps 104:19)
    • promise to Jeremiah that the sun, moon, and stars would continue (Jer 31:35-36)
  • Need for wisdom: It can be used for great good and great harm

Power of Predictive Modeling

  • Medicine: wearable monitor for seizures or falls, detect malaria from blood smears, find effective drug regimens from medical records
  • Drug Discovery: predict the efficacy of a synthesis plan for a drug
  • Precision Agriculture: predict effect of micro-climate on plant growth
  • Urban Planning: forecast resource needs, extreme weather risks, …
  • Government: classify feedback from constituents
  • Retail: predict items in a grocery order
  • Recommendation systems: Amazon, Netflix, YouTube, …
  • User interfaces: gesture typing, autocomplete / autocorrect

and so much more…

The universe is surprisingly predictable

  • God created the world with actionable structure
    • We gradually learn how to perceive that structure and act within it.
    • The better our perceptions align with how the universe is structured, the better our actions
    • We can discover that structure by learning to be less surprised by what we see ( = predicting our perceptions)
  • Perceptions are thus both accurate and fallable.

Predictive modeling technology: Need for wisdom

  • Potential for great good, but also great harm:
    • Lack of fairness in facial recognition, criminal risk, lending, job applicant scoring, …
    • Lack of transparency in how “Big Data” systems make conclusions
    • Lack of privacy as data is increasingly collected and aggregated
    • Amplification of extreme positions in social media, YouTube, etc.
    • Oversimplification of human experience
    • Hidden human labor
    • Illusion of objectivity
    • …!

See, for example, Weapons of Math Destruction

Reminders from Statistics

  • We might wish we had all possible data…
  • We must acknowledge our uncertainty.
    • Our measurements are partial.
    • Our inferences sometimes fail (and we may not know it!)
  • But God made a world with structure that we can learn about even with imperfect tools.

Activity

Teachable Machine

Pick a partner (or two). Decide who’s trainer and who’s tester.

Trainer (needs a laptop):

Tester:

  • Try out your partner’s classifier (without touching the computer or the same objects).
    • Try to find some situations where it works
    • Try to write a succinct description of what situations cause it to behave unexpectedly.

We’ll share descriptions.

Supervised Learning and the scikit-learn API

Scikit-Learn (sklearn)

  • A Python library for machine learning
  • Provides a consistent API for many different kinds of models
  • Provides many tools for data preparation, model evaluation, etc.

Model API: fit, predict

flowchart LR
    A1[("Training Data (X and y)")] --> B{{fit}}
    A2[Model Object] --> B
    B --> FM[Fitted Model]
    FM --> C{{Predict}}
    B2[(New data X)] --> C
    C --> D[("predicted y's")]

Example: penguins

import pandas as pd
penguins = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv")
penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB
penguins_simple = penguins[['bill_depth_mm', 'bill_length_mm', 'species']].dropna().copy()

Which species is each penguin?

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1).fit(
    X=penguins_simple[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_simple['species'])

penguins_simple['predicted_species'] = model.predict(
    X=penguins_simple[['bill_depth_mm', 'bill_length_mm']])
penguins_simple.head()
bill_depth_mm bill_length_mm species predicted_species
0 18.7 39.1 Adelie Adelie
1 17.4 39.5 Adelie Adelie
2 18.0 40.3 Adelie Adelie
4 19.3 36.7 Adelie Adelie
5 20.6 39.3 Adelie Adelie

Are those predictions are good?

penguins_simple.groupby(['species', 'predicted_species'], as_index=False).size()
species predicted_species size
0 Adelie Adelie 151
1 Chinstrap Chinstrap 68
2 Gentoo Gentoo 123
from sklearn.metrics import accuracy_score
accuracy_score(
    y_true=penguins_simple['species'],
    y_pred=penguins_simple['predicted_species'])
1.0

Well, of course…

Why should we be unsurprised that this works?

Train-test split!

flowchart LR
    A1[("All training data")] --> B{{Split}}
    B --> A2[("Training data")]
    B --> A3[("Test data")]
    A2 --> C{{Fit}}
    A3 -->|NEVER!| C
    C --> D[("Fitted model")]
import numpy as np
np.random.seed(0)
penguins_simple["use_for_training"] = np.random.rand(len(penguins_simple)) < 0.5
penguins_train = penguins_simple.query("use_for_training").copy()
penguins_test = penguins_simple.query("not use_for_training").copy()    
print(f"{len(penguins_train)} training penguins, {len(penguins_test)} test penguins")
171 training penguins, 171 test penguins

Note: Later we’ll learn about the train_test_split function, which does this for us.

Train on the training set

model = KNeighborsClassifier(n_neighbors=1).fit(
    X=penguins_train[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_train['species'])

How does it do?

penguins_train['predicted_species'] = model.predict(
    X=penguins_train[['bill_depth_mm', 'bill_length_mm']])
penguins_train.groupby(['species', 'predicted_species'], as_index=False).size()
species predicted_species size
0 Adelie Adelie 69
1 Chinstrap Chinstrap 39
2 Gentoo Gentoo 63
penguins_test['predicted_species'] = model.predict(
    X=penguins_test[['bill_depth_mm', 'bill_length_mm']])
penguins_test.groupby(['species', 'predicted_species'], as_index=False).size()
species predicted_species size
0 Adelie Adelie 81
1 Adelie Chinstrap 1
2 Chinstrap Adelie 2
3 Chinstrap Chinstrap 26
4 Chinstrap Gentoo 1
5 Gentoo Chinstrap 6
6 Gentoo Gentoo 54
accuracy_score(
    y_true=penguins_test['species'],
    y_pred=penguins_test['predicted_species'])
0.9415204678362573

Other kinds of models

  • Nearest Neighbors (KNeighborsClassifier)
  • Decision Trees (DecisionTreeClassifier) and Random Forests (RandomForestClassifier)
  • Linear Models (LogisticRegression, yes it’s a classifier)
  • Neural Networks (MLPClassifier)
  • Boosting methods (HistGradientBoostingClassifier)

Many models, same API

Decision tree:

from sklearn.tree import DecisionTreeClassifier, plot_tree
model = DecisionTreeClassifier(max_depth=3).fit(
        X=penguins_train[['bill_depth_mm', 'bill_length_mm']],
        y=penguins_train['species'])
model.score(
    X=penguins_test[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_test['species'])
0.9122807017543859

What does that tree do?

Figure 1: Decision tree algorithm
Figure 2: Decision tree, in data space

Linear Model (Logistic Regression)

from sklearn.linear_model import LogisticRegression
model = LogisticRegression().fit(
    X=penguins_train[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_train['species'])
model.coef_
array([[ 1.91703649, -0.89183601],
       [ 0.39297577,  0.28951405],
       [-2.31001227,  0.60232196]])
model.score(
    X=penguins_test[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_test['species'])
0.9707602339181286

Neural Network

from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(10, 10)).fit(
    X=penguins_train[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_train['species'])
/usr/local/Caskroom/miniconda/base/envs/fa22/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning:

Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
model.score(
    X=penguins_test[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_test['species'])
0.38596491228070173

Wednesday Slides

Recall

  1. What does fit do?
  2. What does predict do?
  3. Why did we get an accuracy_score of 1.0? What did we do wrong?
  4. What should you never do with the test set?

Types of Learning from Data

  • Supervised Learning: Filling in a label for each item based on features
    • Classification: Labels are discrete (e.g. penguin species, spam/not spam, which word was swiped, …)
    • Regression: Labels are continuous (e.g. sale price, seconds of video watched, …)
      • Forecasting: predict how a sequence will continue (future observations)
  • Unsupervised Learning: Finding relationships between items, doesn’t need labels
    • Clustering: Grouping similar items (e.g. customer segments, …)
    • Dimensionality Reduction: Finding a lower-dimensional representation of data (e.g. word embeddings, …)

We’ll focus on supervised learning and clustering in this course. Others:

  • Reinforcement Learning: Learning to take actions to maximize a reward (e.g. game playing, robotics, …)
  • Self-supervised learning: turning an unsupervised problem into a supervised one (e.g. predicting the next word in a sentence)
  • Statistical inference: fill in summary statistics (we wanted a population but only got a sample)
  • Causal inference: fill in counterfactuals (what if?)

Regression Example 1

Concrete Mixture Strength

From Applied Predictive Modeling (M. Kuhn and Johnson 2013), chapter 10.

from sklearn.datasets import fetch_openml
concrete = fetch_openml(data_id=4353)
concrete_all = concrete['data'].copy()
Original column names:
['Cement (component 1)(kg in a m^3 mixture)', 'Blast Furnace Slag (component 2)(kg in a m^3 mixture)', 'Fly Ash (component 3)(kg in a m^3 mixture)', 'Water  (component 4)(kg in a m^3 mixture)', 'Superplasticizer (component 5)(kg in a m^3 mixture)', 'Coarse Aggregate  (component 6)(kg in a m^3 mixture)', 'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)', 'Concrete compressive strength(MPa. megapascals)']
# clean up column names: remove everything after the first parenthesis, replace spaces with underscores
concrete_all.columns = [
    col.split('(', 1)[0].strip().replace(' ', '_').lower()
    for col in concrete_all.columns]
concrete_all = concrete_all.rename(columns={
    'concrete_compressive_strength': 'strength',
    'blast_furnace_slag': 'slag',
})
concrete_all.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   cement            1030 non-null   float64
 1   slag              1030 non-null   float64
 2   fly_ash           1030 non-null   float64
 3   water             1030 non-null   float64
 4   superplasticizer  1030 non-null   float64
 5   coarse_aggregate  1030 non-null   float64
 6   fine_aggregate    1030 non-null   float64
 7   age               1030 non-null   float64
 8   strength          1030 non-null   float64
dtypes: float64(9)
memory usage: 72.5 KB

Concrete Mixture Strength data

concrete_all.head()
cement slag fly_ash water superplasticizer coarse_aggregate fine_aggregate age strength
0 540.0 0.0 0.0 162.0 2.5 1040.0 676.0 28.0 79.99
1 540.0 0.0 0.0 162.0 2.5 1055.0 676.0 28.0 61.89
2 332.5 142.5 0.0 228.0 0.0 932.0 594.0 270.0 40.27
3 332.5 142.5 0.0 228.0 0.0 932.0 594.0 365.0 41.05
4 198.6 132.4 0.0 192.0 0.0 978.4 825.5 360.0 44.30

Modeling Workflow

Step 1: Define the problem

  • What are you trying to predict?
    • terms: target, response, label
  • What data do you have?
    • terms: features, predictors, covariates
  • What metrics will indicate success? (Measure success in multiple ways!)
    • terms: loss, error, score, metric
    • penguins example: accuracy (perhaps broken down by species)

Concrete example:

  • Task: predict concrete strength (MPa) from mixture ingredients
  • Target: concrete strength
  • Features: mixture ingredients (cement, slag, ash, water, …)
  • Metric: how close are the predictions to the actual strength?

Step 2: Explore the data

  • What is the structure of the data?
  • What are the distributions of the features and target?
  • Anything unexpected? (e.g. missing values, outliers, …): make lots of plots!

Concrete example:

import plotly.express as px
px.histogram(concrete_all, x='strength', nbins=30)
px.histogram(concrete_all, x='age', nbins=30)
# Do we have duplicates of any mixtures that differ only in age and strength?
other_columns = [col for col in concrete_all.columns if col not in ['age', 'strength']]
concrete_all.groupby(other_columns, as_index=False).size().sort_values(by='size', ascending=False).head(10)
cement slag fly_ash water superplasticizer coarse_aggregate fine_aggregate size
356 362.6 189.0 0.0 164.9 11.6 944.7 755.8 20
393 425.0 106.3 0.0 153.5 16.5 852.1 887.1 15
398 446.0 24.0 79.0 162.0 11.6 967.0 712.0 12
413 500.0 0.0 0.0 200.0 0.0 1125.0 613.0 8
355 359.0 19.0 141.0 154.0 10.9 942.0 801.0 8
426 540.0 0.0 0.0 173.0 0.0 1125.0 613.0 7
421 525.0 0.0 0.0 189.0 0.0 1125.0 613.0 7
342 339.0 0.0 0.0 197.0 0.0 968.0 781.0 7
367 380.0 95.0 0.0 228.0 0.0 932.0 594.0 6
147 198.6 132.4 0.0 192.0 0.0 978.4 825.5 6

We’ll need to address this later.

Are they all the same total weight?

concrete_all[other_columns].sum(axis=1).describe()
count    1030.000000
mean     2343.523398
std        65.365356
min      2194.600000
25%      2291.150000
50%      2349.100000
75%      2390.400000
max      2551.000000
dtype: float64

Step 3: Split the data

So we can evaluate the model.

Splitting concrete data

from sklearn.model_selection import train_test_split

concrete_train, concrete_test = train_test_split(concrete_all, random_state=0, test_size=0.2)
print(f"{len(concrete_train)} training mixtures, {len(concrete_test)} test mixtures")
824 training mixtures, 206 test mixtures

Careful: we just duplicated some mixtures in both sets (measurements at different ages).

Step 4: Feature Engineering

Sometimes the features are already in a useful form, but sometimes we can get more useful features by transforming or combining the existing features.

Examples:

  • ratios of ingredients (e.g., coarse aggregate / fine aggregate)
  • log of age (influence of a day decreases as age increases)
  • polynomial features (e.g. \(x^2\))
  • product of two features (interactions)
  • date features (e.g., day of week, month)
  • replace rare categories with “other”

We’ll skip this for concrete for now.

Step 5: Train (fit) and evaluate some models

  • Pick a model (or several) that’s appropriate for the task and data
  • Fit the model to the training data
  • Evaluate the model on the test data

Some models may require special preparation of the data:

  • Missing values
  • Categorical variables
  • Scaling
  • Feature engineering

Step 5: a linear model for concrete

from sklearn.linear_model import LinearRegression
feature_columns = [col for col in concrete_train.columns if col != 'strength']
model = LinearRegression().fit(
    X=concrete_train[feature_columns],
    y=concrete_train['strength'])

concrete_test['predicted_strength'] = model.predict(
    X=concrete_test[feature_columns])

Metrics: Were those predictions good?

from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error, r2_score

def evaluate(y_true, y_pred):
    return pd.Series({
        'MAE': mean_absolute_error(y_true, y_pred),
        'MAPE': mean_absolute_percentage_error(y_true, y_pred),
        'R^2': r2_score(y_true, y_pred),
    })

def evaluate_df(df):
    return evaluate(df['strength'], df['predicted_strength'])

evaluate_df(concrete_test).to_frame('test')
test
MAE 7.864642
MAPE 0.331604
R^2 0.636961
  • MAE: Mean Absolute Error (“predictions are usually off by xxx MPa”)
  • MAPE: Mean Absolute Percent Error (“predictions are usually off by yy%”)
  • Traditional R^2 (fraction of variance explained)

Note: Above, we defined a function to evaluate these errors on any model. That way we can evaluate many models in the same way.

Step 6: Tune the model

  • Try different models and hyperparameters
  • Try different feature engineering
  • Analyze model errors (more EDA)
  • Repeat steps 4-6 until model is good enough

EDA on model errors for Concrete

concrete_test['error'] = concrete_test['strength'] - concrete_test['predicted_strength']
px.histogram(concrete_test, x='error', nbins=30)

Does the model do better or worse for certain ranges?

px.scatter(
    concrete_test,
    x='strength', y='error',
    trendline='ols',
    labels={
        'strength': 'Strength (MPa)',
        'error': 'Error (MPa)',
    },
)
concrete_test['strength_bin'] = pd.cut(concrete_test['strength'], bins=5)
concrete_test.groupby('strength_bin', as_index=False).apply(evaluate_df)
strength_bin MAE MAPE R^2
0 (4.495, 19.516] 9.000968 0.848740 -5.671546
1 (19.516, 34.462] 5.808164 0.213878 -2.287072
2 (34.462, 49.408] 8.305960 0.201023 -5.271035
3 (49.408, 64.354] 8.740660 0.154203 -4.768868
4 (64.354, 79.3] 11.762144 0.164826 -7.339047
concrete_test['age_bin'] = pd.cut(concrete_test['age'], bins=3)
concrete_test.groupby('age_bin', as_index=False).apply(evaluate_df)
age_bin MAE MAPE R^2
0 (2.638, 123.667] 7.760751 0.339492 0.652625
1 (123.667, 244.333] 6.864034 0.188294 0.568809
2 (244.333, 365.0] 11.220562 0.268660 -2.173470

Any relationships with other features?

px.scatter(
    concrete_test,
    x='water', y='error', trendline='ols',
)

We’re not seeing any clear overall patterns. Which mixtures are we worst at predicting? (ask a domain expert about these)

concrete_test.sort_values(by='error').head(5)
cement slag fly_ash water superplasticizer coarse_aggregate fine_aggregate age strength predicted_strength error strength_bin age_bin
477 446.0 24.0 79.0 162.0 11.6 967.0 712.0 3.0 23.35 50.426023 -27.076023 (19.516, 34.462] (2.638, 123.667]
85 379.5 151.2 0.0 153.9 15.9 1134.3 605.0 3.0 28.60 52.987434 -24.387434 (19.516, 34.462] (2.638, 123.667]
622 307.0 0.0 0.0 193.0 0.0 968.0 812.0 365.0 36.15 59.850056 -23.700056 (34.462, 49.408] (244.333, 365.0]
604 339.0 0.0 0.0 197.0 0.0 968.0 781.0 365.0 38.89 62.387515 -23.497515 (34.462, 49.408] (244.333, 365.0]
294 168.9 42.2 124.3 158.3 10.8 1080.8 796.2 3.0 7.40 28.205935 -20.805935 (4.495, 19.516] (2.638, 123.667]
concrete_test.sort_values(by='error').tail(5)
cement slag fly_ash water superplasticizer coarse_aggregate fine_aggregate age strength predicted_strength error strength_bin age_bin
358 277.2 97.8 24.5 160.7 11.2 1061.7 782.5 100.0 66.95 48.291387 18.658613 (64.354, 79.3] (2.638, 123.667]
828 522.0 0.0 0.0 146.0 0.0 896.0 896.0 28.0 74.99 53.777042 21.212958 (64.354, 79.3] (2.638, 123.667]
356 277.2 97.8 24.5 160.7 11.2 1061.7 782.5 28.0 63.14 39.996732 23.143268 (49.408, 64.354] (2.638, 123.667]
8 266.0 114.0 0.0 228.0 0.0 932.0 670.0 28.0 45.85 19.463945 26.386055 (34.462, 49.408] (2.638, 123.667]
14 304.0 76.0 0.0 228.0 0.0 932.0 670.0 28.0 47.81 19.859987 27.950013 (34.462, 49.408] (2.638, 123.667]

Concrete: try a different model

e.g., random forest

from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators=100, random_state=0)

Your turn: how well does a Random Forest model do on the test set?

Step 7: Deploy the model

Often called “inference” (as opposed to “training”)

  • Use the model to make predictions on new data as it arrives
  • Monitor the model’s performance over time
  • Retrain the model periodically