flowchart LR
    A1[("Training Data (X and y)")] --> B{{fit}}
    A2[Model Object] --> B
    B --> FM[Fitted Model]
    FM --> C{{Predict}}
    B2[(New data X)] --> C
    C --> D[("predicted y's")]
How is data used to make decisions that affect you or your neighbors?
Use whatever resources you like.
and so much more…
See, for example, Weapons of Math Destruction
Pick a partner (or two). Decide who’s trainer and who’s tester.
Trainer (needs a laptop):
Tester:
We’ll share descriptions.
fit, predictflowchart LR
    A1[("Training Data (X and y)")] --> B{{fit}}
    A2[Model Object] --> B
    B --> FM[Fitted Model]
    FM --> C{{Predict}}
    B2[(New data X)] --> C
    C --> D[("predicted y's")]

import pandas as pd
penguins = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv")
penguins.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   species            344 non-null    object 
 1   island             344 non-null    object 
 2   bill_length_mm     342 non-null    float64
 3   bill_depth_mm      342 non-null    float64
 4   flipper_length_mm  342 non-null    float64
 5   body_mass_g        342 non-null    float64
 6   sex                333 non-null    object 
 7   year               344 non-null    int64  
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KBfrom sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1).fit(
    X=penguins_simple[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_simple['species'])
penguins_simple['predicted_species'] = model.predict(
    X=penguins_simple[['bill_depth_mm', 'bill_length_mm']])
penguins_simple.head()| bill_depth_mm | bill_length_mm | species | predicted_species | |
|---|---|---|---|---|
| 0 | 18.7 | 39.1 | Adelie | Adelie | 
| 1 | 17.4 | 39.5 | Adelie | Adelie | 
| 2 | 18.0 | 40.3 | Adelie | Adelie | 
| 4 | 19.3 | 36.7 | Adelie | Adelie | 
| 5 | 20.6 | 39.3 | Adelie | Adelie | 
| species | predicted_species | size | |
|---|---|---|---|
| 0 | Adelie | Adelie | 151 | 
| 1 | Chinstrap | Chinstrap | 68 | 
| 2 | Gentoo | Gentoo | 123 | 
Why should we be unsurprised that this works?
flowchart LR
    A1[("All training data")] --> B{{Split}}
    B --> A2[("Training data")]
    B --> A3[("Test data")]
    A2 --> C{{Fit}}
    A3 -->|NEVER!| C
    C --> D[("Fitted model")]
import numpy as np
np.random.seed(0)
penguins_simple["use_for_training"] = np.random.rand(len(penguins_simple)) < 0.5
penguins_train = penguins_simple.query("use_for_training").copy()
penguins_test = penguins_simple.query("not use_for_training").copy()    
print(f"{len(penguins_train)} training penguins, {len(penguins_test)} test penguins")171 training penguins, 171 test penguinsNote: Later we’ll learn about the train_test_split function, which does this for us.
penguins_train['predicted_species'] = model.predict(
    X=penguins_train[['bill_depth_mm', 'bill_length_mm']])
penguins_train.groupby(['species', 'predicted_species'], as_index=False).size()| species | predicted_species | size | |
|---|---|---|---|
| 0 | Adelie | Adelie | 69 | 
| 1 | Chinstrap | Chinstrap | 39 | 
| 2 | Gentoo | Gentoo | 63 | 
penguins_test['predicted_species'] = model.predict(
    X=penguins_test[['bill_depth_mm', 'bill_length_mm']])
penguins_test.groupby(['species', 'predicted_species'], as_index=False).size()| species | predicted_species | size | |
|---|---|---|---|
| 0 | Adelie | Adelie | 81 | 
| 1 | Adelie | Chinstrap | 1 | 
| 2 | Chinstrap | Adelie | 2 | 
| 3 | Chinstrap | Chinstrap | 26 | 
| 4 | Chinstrap | Gentoo | 1 | 
| 5 | Gentoo | Chinstrap | 6 | 
| 6 | Gentoo | Gentoo | 54 | 
KNeighborsClassifier)DecisionTreeClassifier) and Random Forests (RandomForestClassifier)LogisticRegression, yes it’s a classifier)MLPClassifier)HistGradientBoostingClassifier)Decision tree:
from sklearn.tree import DecisionTreeClassifier, plot_tree
model = DecisionTreeClassifier(max_depth=3).fit(
        X=penguins_train[['bill_depth_mm', 'bill_length_mm']],
        y=penguins_train['species'])
model.score(
    X=penguins_test[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_test['species'])0.9122807017543859array([[ 1.91703649, -0.89183601],
       [ 0.39297577,  0.28951405],
       [-2.31001227,  0.60232196]])from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(10, 10)).fit(
    X=penguins_train[['bill_depth_mm', 'bill_length_mm']],
    y=penguins_train['species'])/usr/local/Caskroom/miniconda/base/envs/fa22/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning:
Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
fit do?predict do?accuracy_score of 1.0? What did we do wrong?We’ll focus on supervised learning and clustering in this course. Others:
From Applied Predictive Modeling (M. Kuhn and Johnson 2013), chapter 10.
Original column names:
['Cement (component 1)(kg in a m^3 mixture)', 'Blast Furnace Slag (component 2)(kg in a m^3 mixture)', 'Fly Ash (component 3)(kg in a m^3 mixture)', 'Water  (component 4)(kg in a m^3 mixture)', 'Superplasticizer (component 5)(kg in a m^3 mixture)', 'Coarse Aggregate  (component 6)(kg in a m^3 mixture)', 'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)', 'Concrete compressive strength(MPa. megapascals)']# clean up column names: remove everything after the first parenthesis, replace spaces with underscores
concrete_all.columns = [
    col.split('(', 1)[0].strip().replace(' ', '_').lower()
    for col in concrete_all.columns]
concrete_all = concrete_all.rename(columns={
    'concrete_compressive_strength': 'strength',
    'blast_furnace_slag': 'slag',
})
concrete_all.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   cement            1030 non-null   float64
 1   slag              1030 non-null   float64
 2   fly_ash           1030 non-null   float64
 3   water             1030 non-null   float64
 4   superplasticizer  1030 non-null   float64
 5   coarse_aggregate  1030 non-null   float64
 6   fine_aggregate    1030 non-null   float64
 7   age               1030 non-null   float64
 8   strength          1030 non-null   float64
dtypes: float64(9)
memory usage: 72.5 KB| cement | slag | fly_ash | water | superplasticizer | coarse_aggregate | fine_aggregate | age | strength | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1040.0 | 676.0 | 28.0 | 79.99 | 
| 1 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1055.0 | 676.0 | 28.0 | 61.89 | 
| 2 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 270.0 | 40.27 | 
| 3 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 365.0 | 41.05 | 
| 4 | 198.6 | 132.4 | 0.0 | 192.0 | 0.0 | 978.4 | 825.5 | 360.0 | 44.30 | 

Concrete example:
Concrete example:
# Do we have duplicates of any mixtures that differ only in age and strength?
other_columns = [col for col in concrete_all.columns if col not in ['age', 'strength']]
concrete_all.groupby(other_columns, as_index=False).size().sort_values(by='size', ascending=False).head(10)| cement | slag | fly_ash | water | superplasticizer | coarse_aggregate | fine_aggregate | size | |
|---|---|---|---|---|---|---|---|---|
| 356 | 362.6 | 189.0 | 0.0 | 164.9 | 11.6 | 944.7 | 755.8 | 20 | 
| 393 | 425.0 | 106.3 | 0.0 | 153.5 | 16.5 | 852.1 | 887.1 | 15 | 
| 398 | 446.0 | 24.0 | 79.0 | 162.0 | 11.6 | 967.0 | 712.0 | 12 | 
| 413 | 500.0 | 0.0 | 0.0 | 200.0 | 0.0 | 1125.0 | 613.0 | 8 | 
| 355 | 359.0 | 19.0 | 141.0 | 154.0 | 10.9 | 942.0 | 801.0 | 8 | 
| 426 | 540.0 | 0.0 | 0.0 | 173.0 | 0.0 | 1125.0 | 613.0 | 7 | 
| 421 | 525.0 | 0.0 | 0.0 | 189.0 | 0.0 | 1125.0 | 613.0 | 7 | 
| 342 | 339.0 | 0.0 | 0.0 | 197.0 | 0.0 | 968.0 | 781.0 | 7 | 
| 367 | 380.0 | 95.0 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 6 | 
| 147 | 198.6 | 132.4 | 0.0 | 192.0 | 0.0 | 978.4 | 825.5 | 6 | 
We’ll need to address this later.
Are they all the same total weight?
So we can evaluate the model.

from sklearn.model_selection import train_test_split
concrete_train, concrete_test = train_test_split(concrete_all, random_state=0, test_size=0.2)
print(f"{len(concrete_train)} training mixtures, {len(concrete_test)} test mixtures")824 training mixtures, 206 test mixturesCareful: we just duplicated some mixtures in both sets (measurements at different ages).
Sometimes the features are already in a useful form, but sometimes we can get more useful features by transforming or combining the existing features.
Examples:
age (influence of a day decreases as age increases)We’ll skip this for concrete for now.
Some models may require special preparation of the data:
from sklearn.linear_model import LinearRegression
feature_columns = [col for col in concrete_train.columns if col != 'strength']
model = LinearRegression().fit(
    X=concrete_train[feature_columns],
    y=concrete_train['strength'])
concrete_test['predicted_strength'] = model.predict(
    X=concrete_test[feature_columns])from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error, r2_score
def evaluate(y_true, y_pred):
    return pd.Series({
        'MAE': mean_absolute_error(y_true, y_pred),
        'MAPE': mean_absolute_percentage_error(y_true, y_pred),
        'R^2': r2_score(y_true, y_pred),
    })
def evaluate_df(df):
    return evaluate(df['strength'], df['predicted_strength'])
evaluate_df(concrete_test).to_frame('test')| test | |
|---|---|
| MAE | 7.864642 | 
| MAPE | 0.331604 | 
| R^2 | 0.636961 | 
Note: Above, we defined a function to evaluate these errors on any model. That way we can evaluate many models in the same way.
concrete_test['error'] = concrete_test['strength'] - concrete_test['predicted_strength']
px.histogram(concrete_test, x='error', nbins=30)Does the model do better or worse for certain ranges?
px.scatter(
    concrete_test,
    x='strength', y='error',
    trendline='ols',
    labels={
        'strength': 'Strength (MPa)',
        'error': 'Error (MPa)',
    },
)concrete_test['strength_bin'] = pd.cut(concrete_test['strength'], bins=5)
concrete_test.groupby('strength_bin', as_index=False).apply(evaluate_df)| strength_bin | MAE | MAPE | R^2 | |
|---|---|---|---|---|
| 0 | (4.495, 19.516] | 9.000968 | 0.848740 | -5.671546 | 
| 1 | (19.516, 34.462] | 5.808164 | 0.213878 | -2.287072 | 
| 2 | (34.462, 49.408] | 8.305960 | 0.201023 | -5.271035 | 
| 3 | (49.408, 64.354] | 8.740660 | 0.154203 | -4.768868 | 
| 4 | (64.354, 79.3] | 11.762144 | 0.164826 | -7.339047 | 
concrete_test['age_bin'] = pd.cut(concrete_test['age'], bins=3)
concrete_test.groupby('age_bin', as_index=False).apply(evaluate_df)| age_bin | MAE | MAPE | R^2 | |
|---|---|---|---|---|
| 0 | (2.638, 123.667] | 7.760751 | 0.339492 | 0.652625 | 
| 1 | (123.667, 244.333] | 6.864034 | 0.188294 | 0.568809 | 
| 2 | (244.333, 365.0] | 11.220562 | 0.268660 | -2.173470 | 
Any relationships with other features?
We’re not seeing any clear overall patterns. Which mixtures are we worst at predicting? (ask a domain expert about these)
| cement | slag | fly_ash | water | superplasticizer | coarse_aggregate | fine_aggregate | age | strength | predicted_strength | error | strength_bin | age_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 477 | 446.0 | 24.0 | 79.0 | 162.0 | 11.6 | 967.0 | 712.0 | 3.0 | 23.35 | 50.426023 | -27.076023 | (19.516, 34.462] | (2.638, 123.667] | 
| 85 | 379.5 | 151.2 | 0.0 | 153.9 | 15.9 | 1134.3 | 605.0 | 3.0 | 28.60 | 52.987434 | -24.387434 | (19.516, 34.462] | (2.638, 123.667] | 
| 622 | 307.0 | 0.0 | 0.0 | 193.0 | 0.0 | 968.0 | 812.0 | 365.0 | 36.15 | 59.850056 | -23.700056 | (34.462, 49.408] | (244.333, 365.0] | 
| 604 | 339.0 | 0.0 | 0.0 | 197.0 | 0.0 | 968.0 | 781.0 | 365.0 | 38.89 | 62.387515 | -23.497515 | (34.462, 49.408] | (244.333, 365.0] | 
| 294 | 168.9 | 42.2 | 124.3 | 158.3 | 10.8 | 1080.8 | 796.2 | 3.0 | 7.40 | 28.205935 | -20.805935 | (4.495, 19.516] | (2.638, 123.667] | 
| cement | slag | fly_ash | water | superplasticizer | coarse_aggregate | fine_aggregate | age | strength | predicted_strength | error | strength_bin | age_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 358 | 277.2 | 97.8 | 24.5 | 160.7 | 11.2 | 1061.7 | 782.5 | 100.0 | 66.95 | 48.291387 | 18.658613 | (64.354, 79.3] | (2.638, 123.667] | 
| 828 | 522.0 | 0.0 | 0.0 | 146.0 | 0.0 | 896.0 | 896.0 | 28.0 | 74.99 | 53.777042 | 21.212958 | (64.354, 79.3] | (2.638, 123.667] | 
| 356 | 277.2 | 97.8 | 24.5 | 160.7 | 11.2 | 1061.7 | 782.5 | 28.0 | 63.14 | 39.996732 | 23.143268 | (49.408, 64.354] | (2.638, 123.667] | 
| 8 | 266.0 | 114.0 | 0.0 | 228.0 | 0.0 | 932.0 | 670.0 | 28.0 | 45.85 | 19.463945 | 26.386055 | (34.462, 49.408] | (2.638, 123.667] | 
| 14 | 304.0 | 76.0 | 0.0 | 228.0 | 0.0 | 932.0 | 670.0 | 28.0 | 47.81 | 19.859987 | 27.950013 | (34.462, 49.408] | (2.638, 123.667] | 
e.g., random forest
Your turn: how well does a Random Forest model do on the test set?
Often called “inference” (as opposed to “training”)