flowchart LR A1[("Training Data (X and y)")] --> B{{fit}} A2[Model Object] --> B B --> FM[Fitted Model] FM --> C{{Predict}} B2[(New data X)] --> C C --> D[("predicted y's")]
How is data used to make decisions that affect you or your neighbors?
Use whatever resources you like.
and so much more…
See, for example, Weapons of Math Destruction
Pick a partner (or two). Decide who’s trainer and who’s tester.
Trainer (needs a laptop):
Tester:
We’ll share descriptions.
fit
, predict
flowchart LR A1[("Training Data (X and y)")] --> B{{fit}} A2[Model Object] --> B B --> FM[Fitted Model] FM --> C{{Predict}} B2[(New data X)] --> C C --> D[("predicted y's")]
import pandas as pd
penguins = pd.read_csv("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/main/inst/extdata/penguins.csv")
penguins.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 344 entries, 0 to 343
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 species 344 non-null object
1 island 344 non-null object
2 bill_length_mm 342 non-null float64
3 bill_depth_mm 342 non-null float64
4 flipper_length_mm 342 non-null float64
5 body_mass_g 342 non-null float64
6 sex 333 non-null object
7 year 344 non-null int64
dtypes: float64(4), int64(1), object(3)
memory usage: 21.6+ KB
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(n_neighbors=1).fit(
X=penguins_simple[['bill_depth_mm', 'bill_length_mm']],
y=penguins_simple['species'])
penguins_simple['predicted_species'] = model.predict(
X=penguins_simple[['bill_depth_mm', 'bill_length_mm']])
penguins_simple.head()
bill_depth_mm | bill_length_mm | species | predicted_species | |
---|---|---|---|---|
0 | 18.7 | 39.1 | Adelie | Adelie |
1 | 17.4 | 39.5 | Adelie | Adelie |
2 | 18.0 | 40.3 | Adelie | Adelie |
4 | 19.3 | 36.7 | Adelie | Adelie |
5 | 20.6 | 39.3 | Adelie | Adelie |
species | predicted_species | size | |
---|---|---|---|
0 | Adelie | Adelie | 151 |
1 | Chinstrap | Chinstrap | 68 |
2 | Gentoo | Gentoo | 123 |
Why should we be unsurprised that this works?
flowchart LR A1[("All training data")] --> B{{Split}} B --> A2[("Training data")] B --> A3[("Test data")] A2 --> C{{Fit}} A3 -->|NEVER!| C C --> D[("Fitted model")]
import numpy as np
np.random.seed(0)
penguins_simple["use_for_training"] = np.random.rand(len(penguins_simple)) < 0.5
penguins_train = penguins_simple.query("use_for_training").copy()
penguins_test = penguins_simple.query("not use_for_training").copy()
print(f"{len(penguins_train)} training penguins, {len(penguins_test)} test penguins")
171 training penguins, 171 test penguins
Note: Later we’ll learn about the train_test_split
function, which does this for us.
penguins_train['predicted_species'] = model.predict(
X=penguins_train[['bill_depth_mm', 'bill_length_mm']])
penguins_train.groupby(['species', 'predicted_species'], as_index=False).size()
species | predicted_species | size | |
---|---|---|---|
0 | Adelie | Adelie | 69 |
1 | Chinstrap | Chinstrap | 39 |
2 | Gentoo | Gentoo | 63 |
penguins_test['predicted_species'] = model.predict(
X=penguins_test[['bill_depth_mm', 'bill_length_mm']])
penguins_test.groupby(['species', 'predicted_species'], as_index=False).size()
species | predicted_species | size | |
---|---|---|---|
0 | Adelie | Adelie | 81 |
1 | Adelie | Chinstrap | 1 |
2 | Chinstrap | Adelie | 2 |
3 | Chinstrap | Chinstrap | 26 |
4 | Chinstrap | Gentoo | 1 |
5 | Gentoo | Chinstrap | 6 |
6 | Gentoo | Gentoo | 54 |
KNeighborsClassifier
)DecisionTreeClassifier
) and Random Forests (RandomForestClassifier
)LogisticRegression
, yes it’s a classifier)MLPClassifier
)HistGradientBoostingClassifier
)Decision tree:
from sklearn.tree import DecisionTreeClassifier, plot_tree
model = DecisionTreeClassifier(max_depth=3).fit(
X=penguins_train[['bill_depth_mm', 'bill_length_mm']],
y=penguins_train['species'])
model.score(
X=penguins_test[['bill_depth_mm', 'bill_length_mm']],
y=penguins_test['species'])
0.9122807017543859
array([[ 1.91703649, -0.89183601],
[ 0.39297577, 0.28951405],
[-2.31001227, 0.60232196]])
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(10, 10)).fit(
X=penguins_train[['bill_depth_mm', 'bill_length_mm']],
y=penguins_train['species'])
/usr/local/Caskroom/miniconda/base/envs/fa22/lib/python3.10/site-packages/sklearn/neural_network/_multilayer_perceptron.py:702: ConvergenceWarning:
Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.
fit
do?predict
do?accuracy_score
of 1.0? What did we do wrong?We’ll focus on supervised learning and clustering in this course. Others:
From Applied Predictive Modeling (M. Kuhn and Johnson 2013), chapter 10.
Original column names:
['Cement (component 1)(kg in a m^3 mixture)', 'Blast Furnace Slag (component 2)(kg in a m^3 mixture)', 'Fly Ash (component 3)(kg in a m^3 mixture)', 'Water (component 4)(kg in a m^3 mixture)', 'Superplasticizer (component 5)(kg in a m^3 mixture)', 'Coarse Aggregate (component 6)(kg in a m^3 mixture)', 'Fine Aggregate (component 7)(kg in a m^3 mixture)', 'Age (day)', 'Concrete compressive strength(MPa. megapascals)']
# clean up column names: remove everything after the first parenthesis, replace spaces with underscores
concrete_all.columns = [
col.split('(', 1)[0].strip().replace(' ', '_').lower()
for col in concrete_all.columns]
concrete_all = concrete_all.rename(columns={
'concrete_compressive_strength': 'strength',
'blast_furnace_slag': 'slag',
})
concrete_all.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cement 1030 non-null float64
1 slag 1030 non-null float64
2 fly_ash 1030 non-null float64
3 water 1030 non-null float64
4 superplasticizer 1030 non-null float64
5 coarse_aggregate 1030 non-null float64
6 fine_aggregate 1030 non-null float64
7 age 1030 non-null float64
8 strength 1030 non-null float64
dtypes: float64(9)
memory usage: 72.5 KB
cement | slag | fly_ash | water | superplasticizer | coarse_aggregate | fine_aggregate | age | strength | |
---|---|---|---|---|---|---|---|---|---|
0 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1040.0 | 676.0 | 28.0 | 79.99 |
1 | 540.0 | 0.0 | 0.0 | 162.0 | 2.5 | 1055.0 | 676.0 | 28.0 | 61.89 |
2 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 270.0 | 40.27 |
3 | 332.5 | 142.5 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 365.0 | 41.05 |
4 | 198.6 | 132.4 | 0.0 | 192.0 | 0.0 | 978.4 | 825.5 | 360.0 | 44.30 |
Concrete example:
Concrete example:
# Do we have duplicates of any mixtures that differ only in age and strength?
other_columns = [col for col in concrete_all.columns if col not in ['age', 'strength']]
concrete_all.groupby(other_columns, as_index=False).size().sort_values(by='size', ascending=False).head(10)
cement | slag | fly_ash | water | superplasticizer | coarse_aggregate | fine_aggregate | size | |
---|---|---|---|---|---|---|---|---|
356 | 362.6 | 189.0 | 0.0 | 164.9 | 11.6 | 944.7 | 755.8 | 20 |
393 | 425.0 | 106.3 | 0.0 | 153.5 | 16.5 | 852.1 | 887.1 | 15 |
398 | 446.0 | 24.0 | 79.0 | 162.0 | 11.6 | 967.0 | 712.0 | 12 |
413 | 500.0 | 0.0 | 0.0 | 200.0 | 0.0 | 1125.0 | 613.0 | 8 |
355 | 359.0 | 19.0 | 141.0 | 154.0 | 10.9 | 942.0 | 801.0 | 8 |
426 | 540.0 | 0.0 | 0.0 | 173.0 | 0.0 | 1125.0 | 613.0 | 7 |
421 | 525.0 | 0.0 | 0.0 | 189.0 | 0.0 | 1125.0 | 613.0 | 7 |
342 | 339.0 | 0.0 | 0.0 | 197.0 | 0.0 | 968.0 | 781.0 | 7 |
367 | 380.0 | 95.0 | 0.0 | 228.0 | 0.0 | 932.0 | 594.0 | 6 |
147 | 198.6 | 132.4 | 0.0 | 192.0 | 0.0 | 978.4 | 825.5 | 6 |
We’ll need to address this later.
Are they all the same total weight?
So we can evaluate the model.
from sklearn.model_selection import train_test_split
concrete_train, concrete_test = train_test_split(concrete_all, random_state=0, test_size=0.2)
print(f"{len(concrete_train)} training mixtures, {len(concrete_test)} test mixtures")
824 training mixtures, 206 test mixtures
Careful: we just duplicated some mixtures in both sets (measurements at different ages).
Sometimes the features are already in a useful form, but sometimes we can get more useful features by transforming or combining the existing features.
Examples:
age
(influence of a day decreases as age increases)We’ll skip this for concrete for now.
Some models may require special preparation of the data:
from sklearn.linear_model import LinearRegression
feature_columns = [col for col in concrete_train.columns if col != 'strength']
model = LinearRegression().fit(
X=concrete_train[feature_columns],
y=concrete_train['strength'])
concrete_test['predicted_strength'] = model.predict(
X=concrete_test[feature_columns])
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error, r2_score
def evaluate(y_true, y_pred):
return pd.Series({
'MAE': mean_absolute_error(y_true, y_pred),
'MAPE': mean_absolute_percentage_error(y_true, y_pred),
'R^2': r2_score(y_true, y_pred),
})
def evaluate_df(df):
return evaluate(df['strength'], df['predicted_strength'])
evaluate_df(concrete_test).to_frame('test')
test | |
---|---|
MAE | 7.864642 |
MAPE | 0.331604 |
R^2 | 0.636961 |
Note: Above, we defined a function to evaluate these errors on any model. That way we can evaluate many models in the same way.
concrete_test['error'] = concrete_test['strength'] - concrete_test['predicted_strength']
px.histogram(concrete_test, x='error', nbins=30)
Does the model do better or worse for certain ranges?
px.scatter(
concrete_test,
x='strength', y='error',
trendline='ols',
labels={
'strength': 'Strength (MPa)',
'error': 'Error (MPa)',
},
)
concrete_test['strength_bin'] = pd.cut(concrete_test['strength'], bins=5)
concrete_test.groupby('strength_bin', as_index=False).apply(evaluate_df)
strength_bin | MAE | MAPE | R^2 | |
---|---|---|---|---|
0 | (4.495, 19.516] | 9.000968 | 0.848740 | -5.671546 |
1 | (19.516, 34.462] | 5.808164 | 0.213878 | -2.287072 |
2 | (34.462, 49.408] | 8.305960 | 0.201023 | -5.271035 |
3 | (49.408, 64.354] | 8.740660 | 0.154203 | -4.768868 |
4 | (64.354, 79.3] | 11.762144 | 0.164826 | -7.339047 |
concrete_test['age_bin'] = pd.cut(concrete_test['age'], bins=3)
concrete_test.groupby('age_bin', as_index=False).apply(evaluate_df)
age_bin | MAE | MAPE | R^2 | |
---|---|---|---|---|
0 | (2.638, 123.667] | 7.760751 | 0.339492 | 0.652625 |
1 | (123.667, 244.333] | 6.864034 | 0.188294 | 0.568809 |
2 | (244.333, 365.0] | 11.220562 | 0.268660 | -2.173470 |
Any relationships with other features?
We’re not seeing any clear overall patterns. Which mixtures are we worst at predicting? (ask a domain expert about these)
cement | slag | fly_ash | water | superplasticizer | coarse_aggregate | fine_aggregate | age | strength | predicted_strength | error | strength_bin | age_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
477 | 446.0 | 24.0 | 79.0 | 162.0 | 11.6 | 967.0 | 712.0 | 3.0 | 23.35 | 50.426023 | -27.076023 | (19.516, 34.462] | (2.638, 123.667] |
85 | 379.5 | 151.2 | 0.0 | 153.9 | 15.9 | 1134.3 | 605.0 | 3.0 | 28.60 | 52.987434 | -24.387434 | (19.516, 34.462] | (2.638, 123.667] |
622 | 307.0 | 0.0 | 0.0 | 193.0 | 0.0 | 968.0 | 812.0 | 365.0 | 36.15 | 59.850056 | -23.700056 | (34.462, 49.408] | (244.333, 365.0] |
604 | 339.0 | 0.0 | 0.0 | 197.0 | 0.0 | 968.0 | 781.0 | 365.0 | 38.89 | 62.387515 | -23.497515 | (34.462, 49.408] | (244.333, 365.0] |
294 | 168.9 | 42.2 | 124.3 | 158.3 | 10.8 | 1080.8 | 796.2 | 3.0 | 7.40 | 28.205935 | -20.805935 | (4.495, 19.516] | (2.638, 123.667] |
cement | slag | fly_ash | water | superplasticizer | coarse_aggregate | fine_aggregate | age | strength | predicted_strength | error | strength_bin | age_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
358 | 277.2 | 97.8 | 24.5 | 160.7 | 11.2 | 1061.7 | 782.5 | 100.0 | 66.95 | 48.291387 | 18.658613 | (64.354, 79.3] | (2.638, 123.667] |
828 | 522.0 | 0.0 | 0.0 | 146.0 | 0.0 | 896.0 | 896.0 | 28.0 | 74.99 | 53.777042 | 21.212958 | (64.354, 79.3] | (2.638, 123.667] |
356 | 277.2 | 97.8 | 24.5 | 160.7 | 11.2 | 1061.7 | 782.5 | 28.0 | 63.14 | 39.996732 | 23.143268 | (49.408, 64.354] | (2.638, 123.667] |
8 | 266.0 | 114.0 | 0.0 | 228.0 | 0.0 | 932.0 | 670.0 | 28.0 | 45.85 | 19.463945 | 26.386055 | (34.462, 49.408] | (2.638, 123.667] |
14 | 304.0 | 76.0 | 0.0 | 228.0 | 0.0 | 932.0 | 670.0 | 28.0 | 47.81 | 19.859987 | 27.950013 | (34.462, 49.408] | (2.638, 123.667] |
e.g., random forest
Your turn: how well does a Random Forest model do on the test set?
Often called “inference” (as opposed to “training”)