Exercise 12: Logistic Regression, Regularization, and Cross-Validation

In this exercise, we’ll practice classification using a logistic regression model. Since this model is prone to overfit, we’ll learn how to tame its excesses with regularization, and estimate how well it will generalize to new data using cross-validation.

Completing this exercise will help you be able to:

Along the way, we’ll also see a technique for handling missing data, and get more practice with how to build pipelines in scikit-learn.

Code will be provided for this exercise. You should retype the code on your own and try to understand what each part does. There will also be some conceptual questions for you to answer.

Getting Started

We’ll use a dataset of who survived the Titanic shipwreck. To help make the point of this exercise more clear, we’ll also augment the dataset with some irrelevant (noise) features. We’ll keep the original goal, though: to predict whether a passenger survived or not.1

This task is a classic in machine learning, but I’ve never seen anyone mention how fake this prediction task is! There is no real-world scenario where you get to see whether some passengers survived and some didn’t, and then you have to predict whether other passengers survived or not. When you’re doing projects, please think about whether the task you’re trying to solve is (a) a task that anyone would actually care about and (b) a task that’s doable using data that you’d actually have access to.

Here are the imports you’ll need.

```{python}
import pandas as pd
import numpy as np

from sklearn.compose import make_column_transformer
from sklearn.datasets import fetch_openml, load_breast_cancer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, LeaveOneOut, cross_val_score, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score

import plotly.express as px
```

We’ll load the data from a site called OpenML. OpenML is a repository of datasets that are useful for machine learning. Specifically we’ll load the Titanic dataset. This dataset contains information about passengers on the Titanic, including whether they survived or not. The Kaggle page for this dataset has more information about the features.

```{python}
titanic_all = fetch_openml(
    data_id=40945, as_frame=True, parser="pandas"
)
titanic = titanic_all.frame
```

Now we’ll add some noise features. Just copy-paste this code; don’t worry about how it works.

rng = np.random.RandomState(0)
n_samples = titanic.shape[0]
n_features = 500
noise_feature_names = [f"noise_{i}" for i in range(n_features)]
noise_features = pd.DataFrame(
    rng.normal(size=(n_samples, n_features)),
    columns=noise_feature_names
)
titanic = pd.concat([titanic, noise_features], axis=1)
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Columns: 514 entries, pclass to noise_499
dtypes: category(3), float64(503), int64(3), object(5)
memory usage: 5.1+ MB

We’ll use two numeric features (in addition to the noise features):

  • age: the age of the passenger
  • fare: the fare paid by the passenger

And three categorical features:

  • embarked: the port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)
  • sex
  • pclass: the passenger class (1 = first class, 2 = second class, 3 = third class)

Note that pclass is encoded as a number, but it’s really a categorical feature. We’ll treat it as categorical.

```{python}
# Select the features and target
target_column = "survived"
numeric_features = ["age", "fare"] + noise_feature_names
categorical_features = ["embarked", "sex", "pclass"]
feature_columns = numeric_features + categorical_features
```

Let’s hold out a test set. We’ll use a large test set so that our estimates of generalization error have less variance.

```{python}
titanic_train, titanic_test = train_test_split(
    titanic, test_size=.5, random_state=0
)
print(f"Training set size: {titanic_train.shape[0]}")
print(f"Test set size: {titanic_test.shape[0]}")
```
Training set size: 654
Test set size: 655

The numeric features will need some preprocessing. We’ll first scale them to be zero mean and unit variance, and then impute missing values. Note: at this step we’re just defining what processing will happen; the actual processing will be done later when we fit the pipeline.

```{python}
numeric_transformer = make_pipeline(
    StandardScaler(),
    SimpleImputer(strategy="median")
)
```

Pause to think:

  • What is strategy="median" doing? What alternative strategies could we use for imputing missing values? Do you think the median would be a good guess for the missing values?
  • Why do we need to impute missing values after scaling the features? What would happen if we did it the other way around?

The categorical features need to be one-hot-encoded.

```{python}
categorical_transformer = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore", sparse_output=False)
)
```

Pause to think: How many columns will the categorical transformer produce? What will each column represent? Write a brief explanation.

We can combine the numeric and categorical transformers into a single transformer that will apply each one to the appropriate columns.

```{python}
preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features)
).set_output(transform='pandas')
```

Let’s show an example of what this preprocessor will do to our data. Check your results from above.

example_preprocessed_data = preprocessor.fit_transform(titanic_train[feature_columns])
example_preprocessed_data.info()
example_preprocessed_data.head()
<class 'pandas.core.frame.DataFrame'>
Index: 654 entries, 293 to 684
Columns: 510 entries, pipeline-1__age to pipeline-2__pclass_3
dtypes: float64(510)
memory usage: 2.5 MB
pipeline-1__age pipeline-1__fare pipeline-1__noise_0 pipeline-1__noise_1 pipeline-1__noise_2 pipeline-1__noise_3 pipeline-1__noise_4 pipeline-1__noise_5 pipeline-1__noise_6 pipeline-1__noise_7 ... pipeline-1__noise_498 pipeline-1__noise_499 pipeline-2__embarked_C pipeline-2__embarked_Q pipeline-2__embarked_S pipeline-2__sex_female pipeline-2__sex_male pipeline-2__pclass_1 pipeline-2__pclass_2 pipeline-2__pclass_3
293 -0.140868 0.405374 0.534576 -0.178911 1.530273 0.527998 -1.732251 0.694583 0.769149 -1.213062 ... 0.115889 -0.166542 0.0 0.0 1.0 1.0 0.0 1.0 0.0 0.0
76 0.649024 1.070188 -0.250568 -0.502139 0.217859 -1.081952 -1.166470 0.074495 0.160803 -1.291101 ... -1.461193 -0.095610 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
48 1.654342 -0.118530 -0.283955 1.668724 -0.704216 -1.267655 0.303582 -0.202661 0.905217 -0.310839 ... 0.240206 0.980616 1.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0
689 -0.571718 -0.549441 -0.739937 -0.022050 -1.425322 2.253745 0.053555 0.068434 -0.809622 1.080123 ... -1.503977 -0.938596 0.0 0.0 1.0 0.0 1.0 0.0 0.0 1.0
1195 -0.140868 -0.538772 0.097809 -0.949562 0.559721 0.118893 -0.004897 0.996909 -0.111381 -0.672056 ... -1.530474 2.113191 0.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0

5 rows × 510 columns

Notice that there are no missing values and that there are separate columns for each of the possible values of the categorical variables.

Now we can put all of those elements together to train a basic logistic regression model.

```{python}
model = make_pipeline(
    preprocessor,
    LogisticRegression(penalty=None)
)
```

Your turn: fit and evaluate this model (using model.fit(...)). What is the accuracy on the training set? What is the accuracy on the test set?

Accuracy on training set: 1.000
Accuracy on test set: 0.652

What do you notice about the accuracy of the model on the training set vs the test set?

Regularization

The model was able to get a perfect score on the training set. But to do that, it used some of the noise features, which clearly will not generalize to new data. We can see this by looking at the coefficients of the model:

def get_coefs_df(model, feature_names):
    logreg = model.named_steps['logisticregression']
    assert logreg.coef_.shape[0] == 1
    assert logreg.coef_.shape[1] == len(feature_names)
    coefs_df = pd.DataFrame(dict(
        name=feature_names,
        coef=logreg.coef_[0]
    ))
    coefs_df['abs_coef'] = coefs_df['coef'].abs()
    return coefs_df.sort_values('abs_coef', ascending=False)

coefs_df = get_coefs_df(model, example_preprocessed_data.columns)
coefs_df.head(10)
name coef abs_coef
506 pipeline-2__sex_male -12.828261 12.828261
505 pipeline-2__sex_female 11.729937 11.729937
509 pipeline-2__pclass_3 -7.393828 7.393828
1 pipeline-1__fare 5.932705 5.932705
381 pipeline-1__noise_379 -5.126387 5.126387
507 pipeline-2__pclass_1 4.860676 4.860676
473 pipeline-1__noise_471 -4.829957 4.829957
104 pipeline-1__noise_102 4.396529 4.396529
193 pipeline-1__noise_191 -4.325752 4.325752
484 pipeline-1__noise_482 -4.149559 4.149559
coefs_df['is_noise'] = coefs_df['name'].str.contains('noise')
print("Total noise coefficients: {:.2f}".format(
    coefs_df.query('is_noise')['coef'].abs().sum())
)
Total noise coefficients: 617.16

If we could get the model to put less weight on the noise features, it might generalize better. But in a real example, we don’t know which features are noise. And some features may be partly noise, partly signal. So we can’t just remove the noise features.

Instead, we can use regularization to encourage the model to put less weight on any individual feature. The hope is that this will encourage the model to put more weight on the features that are signal, and less weight on the features that are noise.

```{python}
model = make_pipeline(
    preprocessor,
    LogisticRegression(C=1e-1, penalty='l1', solver='liblinear')
).fit(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
)
print("Accuracy on training set:", model.score(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
))
print("Accuracy on test set:", model.score(
    X=titanic_test[feature_columns],
    y=titanic_test[target_column]
))
```
Accuracy on training set: 0.8669724770642202
Accuracy on test set: 0.76793893129771
coefs_df = get_coefs_df(model, example_preprocessed_data.columns)
coefs_df.head(10)
name coef abs_coef
506 pipeline-2__sex_male -1.213080 1.213080
505 pipeline-2__sex_female 0.934112 0.934112
509 pipeline-2__pclass_3 -0.544729 0.544729
507 pipeline-2__pclass_1 0.422453 0.422453
473 pipeline-1__noise_471 -0.216718 0.216718
193 pipeline-1__noise_191 -0.190734 0.190734
381 pipeline-1__noise_379 -0.167910 0.167910
389 pipeline-1__noise_387 -0.161352 0.161352
484 pipeline-1__noise_482 -0.152020 0.152020
296 pipeline-1__noise_294 -0.138615 0.138615
coefs_df['is_noise'] = coefs_df['name'].str.contains('noise')
print("Total weight on noise coefficients: {:.2f}".format(
    coefs_df.query('is_noise')['coef'].abs().sum())
)
Total weight on noise coefficients: 5.27

Your turn: what do you notice about the model’s accuracy scores now? What do you notice about the coefficients?

Your turn: Try adjusting the value of C in the model. Answer:

  • How does the training accuracy change as C changes?
  • What value of C gives the best training accuracy?
  • How does the test accuracy change as C changes?
  • What value of C gives the best test accuracy?
  • How does the amount of weight that the model puts on the noise features change as C changes?
  • Does a high value of C seem to correspond to regularizing a lot or a little?

This plot might help you answer some of those questions:

# Make a plot of train and test accuracy vs C
C_values_to_try = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
train_accuracies = []
test_accuracies = []
for C in C_values_to_try:
    model = make_pipeline(
        preprocessor,
        LogisticRegression(C=C, penalty='l1', solver='liblinear')
    ).fit(
        X=titanic_train[feature_columns],
        y=titanic_train[target_column]
    )
    train_accuracies.append(model.score(
        X=titanic_train[feature_columns],
        y=titanic_train[target_column]
    ))
    test_accuracies.append(model.score(
        X=titanic_test[feature_columns],
        y=titanic_test[target_column]
    ))
accuracies_df = pd.DataFrame(dict(
    C=C_values_to_try,
    train_accuracy=train_accuracies,
    test_accuracy=test_accuracies
)).melt(id_vars='C', var_name='dataset', value_name='accuracy')
px.line(accuracies_df, x='C', y='accuracy', color='dataset', log_x=True)

Cross-Validation

We’ve seen that regularization can help prevent overfitting. But how do we know how much regularization to use? We could try different values of the regularization parameter C and see which one gives the best accuracy on the test set. But that would be a waste of data. We’d be using the test set to choose the model, which would make it less reliable as an estimate of generalization error.

Instead, we can use cross-validation to estimate how well the model will generalize to new data. The idea is to hold out one of the data points, train the model on the rest, and then see how well it predicts the held-out data point. We can repeat this process for each data point, and then average the scores to get an estimate of how well the model will generalize.

Buuut… that requires training the model as many times as there are data points. That’s a lot of training! (I tried it and got tired of waiting.) So, instead of holding out one data point at a time, we can hold out a bunch of points at a time. These groups of points are called folds. We then test the model on each fold, using the other folds as the training set. We can then average the scores on each fold to get an estimate of how well the model will generalize.

Your turn: If we have 100 data points and we use 10-fold cross-validation, how many data points will be in each fold? How many times will we train a model? How big will the training set be each time? How big will the test set be each time?

To use cross-validation to find a good value of C, we can use scikit-learn’s GridSearchCV class. This class will try each value of C in C_values_to_try, and use cross-validation to estimate how well the model will generalize.

```{python}
C_values_to_try = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]

model = make_pipeline(
    preprocessor,
    LogisticRegression(C=C, penalty='l1', solver='liblinear')
)
search = GridSearchCV(
    model,
    param_grid=dict(
        logisticregression__C=C_values_to_try
    ),
    cv=5,
    verbose=4
).fit(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
)
```
Fitting 5 folds for each of 7 candidates, totalling 35 fits
[CV 1/5] END ......logisticregression__C=0.0001;, score=0.618 total time=   0.0s
[CV 2/5] END ......logisticregression__C=0.0001;, score=0.618 total time=   0.1s
[CV 3/5] END ......logisticregression__C=0.0001;, score=0.618 total time=   0.1s
[CV 4/5] END ......logisticregression__C=0.0001;, score=0.618 total time=   0.1s
[CV 5/5] END ......logisticregression__C=0.0001;, score=0.615 total time=   0.1s
[CV 1/5] END .......logisticregression__C=0.001;, score=0.618 total time=   0.1s
[CV 2/5] END .......logisticregression__C=0.001;, score=0.618 total time=   0.1s
[CV 3/5] END .......logisticregression__C=0.001;, score=0.618 total time=   0.1s
[CV 4/5] END .......logisticregression__C=0.001;, score=0.618 total time=   0.1s
[CV 5/5] END .......logisticregression__C=0.001;, score=0.615 total time=   0.1s
[CV 1/5] END ........logisticregression__C=0.01;, score=0.618 total time=   0.1s
[CV 2/5] END ........logisticregression__C=0.01;, score=0.618 total time=   0.1s
[CV 3/5] END ........logisticregression__C=0.01;, score=0.618 total time=   0.1s
[CV 4/5] END ........logisticregression__C=0.01;, score=0.618 total time=   0.1s
[CV 5/5] END ........logisticregression__C=0.01;, score=0.615 total time=   0.1s
[CV 1/5] END .........logisticregression__C=0.1;, score=0.748 total time=   0.1s
[CV 2/5] END .........logisticregression__C=0.1;, score=0.740 total time=   0.1s
[CV 3/5] END .........logisticregression__C=0.1;, score=0.771 total time=   0.1s
[CV 4/5] END .........logisticregression__C=0.1;, score=0.771 total time=   0.1s
[CV 5/5] END .........logisticregression__C=0.1;, score=0.738 total time=   0.1s
[CV 1/5] END ...........logisticregression__C=1;, score=0.626 total time=   0.1s
[CV 2/5] END ...........logisticregression__C=1;, score=0.687 total time=   0.1s
[CV 3/5] END ...........logisticregression__C=1;, score=0.679 total time=   0.1s
[CV 4/5] END ...........logisticregression__C=1;, score=0.649 total time=   0.1s
[CV 5/5] END ...........logisticregression__C=1;, score=0.623 total time=   0.1s
[CV 1/5] END ..........logisticregression__C=10;, score=0.611 total time=   0.1s
[CV 2/5] END ..........logisticregression__C=10;, score=0.679 total time=   0.1s
[CV 3/5] END ..........logisticregression__C=10;, score=0.679 total time=   0.1s
[CV 4/5] END ..........logisticregression__C=10;, score=0.626 total time=   0.1s
[CV 5/5] END ..........logisticregression__C=10;, score=0.631 total time=   0.1s
[CV 1/5] END .........logisticregression__C=100;, score=0.588 total time=   0.1s
[CV 2/5] END .........logisticregression__C=100;, score=0.695 total time=   0.1s
[CV 3/5] END .........logisticregression__C=100;, score=0.672 total time=   0.1s
[CV 4/5] END .........logisticregression__C=100;, score=0.626 total time=   0.1s
[CV 5/5] END .........logisticregression__C=100;, score=0.615 total time=   0.1s

Let’s see what the cross validation process estimated for each value of C.

cv_results = pd.DataFrame(search.cv_results_)
cv_results
mean_fit_time std_fit_time mean_score_time std_score_time param_logisticregression__C params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 0.050302 0.005591 0.013186 0.002130 0.0001 {'logisticregression__C': 0.0001} 0.618321 0.618321 0.618321 0.618321 0.615385 0.617733 0.001174 5
1 0.053057 0.000240 0.014282 0.001332 0.001 {'logisticregression__C': 0.001} 0.618321 0.618321 0.618321 0.618321 0.615385 0.617733 0.001174 5
2 0.053784 0.000100 0.013579 0.000115 0.01 {'logisticregression__C': 0.01} 0.618321 0.618321 0.618321 0.618321 0.615385 0.617733 0.001174 5
3 0.061919 0.001214 0.013345 0.000388 0.1 {'logisticregression__C': 0.1} 0.748092 0.740458 0.770992 0.770992 0.738462 0.753799 0.014402 1
4 0.101603 0.013110 0.009218 0.000028 1 {'logisticregression__C': 1} 0.625954 0.687023 0.679389 0.648855 0.623077 0.652860 0.026450 2
5 0.112736 0.007271 0.009206 0.000023 10 {'logisticregression__C': 10} 0.610687 0.679389 0.679389 0.625954 0.630769 0.645238 0.028662 3
6 0.096419 0.003384 0.009305 0.000087 100 {'logisticregression__C': 100} 0.587786 0.694656 0.671756 0.625954 0.615385 0.639107 0.038785 4

Let’s reshape that data frame to make it easier to plot.

cv_results_melted = cv_results.melt(
    id_vars='param_logisticregression__C',
    value_vars=['split{}_test_score'.format(i) for i in range(5)],
    var_name='fold',
    value_name='score'
).rename(columns=dict(
    param_logisticregression__C='C'
))
cv_results_melted
C fold score
0 0.0001 split0_test_score 0.618321
1 0.001 split0_test_score 0.618321
2 0.01 split0_test_score 0.618321
3 0.1 split0_test_score 0.748092
4 1 split0_test_score 0.625954
5 10 split0_test_score 0.610687
6 100 split0_test_score 0.587786
7 0.0001 split1_test_score 0.618321
8 0.001 split1_test_score 0.618321
9 0.01 split1_test_score 0.618321
10 0.1 split1_test_score 0.740458
11 1 split1_test_score 0.687023
12 10 split1_test_score 0.679389
13 100 split1_test_score 0.694656
14 0.0001 split2_test_score 0.618321
15 0.001 split2_test_score 0.618321
16 0.01 split2_test_score 0.618321
17 0.1 split2_test_score 0.770992
18 1 split2_test_score 0.679389
19 10 split2_test_score 0.679389
20 100 split2_test_score 0.671756
21 0.0001 split3_test_score 0.618321
22 0.001 split3_test_score 0.618321
23 0.01 split3_test_score 0.618321
24 0.1 split3_test_score 0.770992
25 1 split3_test_score 0.648855
26 10 split3_test_score 0.625954
27 100 split3_test_score 0.625954
28 0.0001 split4_test_score 0.615385
29 0.001 split4_test_score 0.615385
30 0.01 split4_test_score 0.615385
31 0.1 split4_test_score 0.738462
32 1 split4_test_score 0.623077
33 10 split4_test_score 0.630769
34 100 split4_test_score 0.615385
px.box(
    cv_results_melted,
    x="C", y="score", log_x=True,
    title="Cross-validation scores", labels=dict(
        C="Regularization strength (C)",
        score="Accuracy"
))

It will then choose the value of C that gives the best cross-validation score, and train a model using that value of C on the entire training set. The grid-search process estimated that the best value of C is:

search.best_params_
{'logisticregression__C': 0.1}

We can use that model to make predictions on the test set.

print("Accuracy on training set:", search.score(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
))
print("Accuracy on test set:", search.score(
    X=titanic_test[feature_columns],
    y=titanic_test[target_column]
))
Accuracy on training set: 0.8669724770642202
Accuracy on test set: 0.76793893129771

Your turn: How did that test set accuracy compare with the accuracy estimated by cross validation?

Gradient Boosting

Just to leave you one last good example: gradient boosting classifiers are often a good choice for many problems. Let’s try one on this dataset.

from sklearn.ensemble import HistGradientBoostingClassifier

Since gradient boosting models can handle missing data and don’t require scaling, let’s use an even simpler pipeline.

model = make_pipeline(
    make_column_transformer(
        ('passthrough', numeric_features),
        (OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features)
    ),
    HistGradientBoostingClassifier()
).fit(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
)
print("Accuracy on training set:", model.score(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
))
Accuracy on training set: 1.0
print("Accuracy on test set:", model.score(
    X=titanic_test[feature_columns],
    y=titanic_test[target_column]
))
Accuracy on test set: 0.783206106870229

How did it do that? Read about gradient boosted trees in the sklearn docs!

Note that gradient boosting can also be used for regression tasks (HistGradientBoostingRegressor).

Footnotes

  1. This exercise is based in part on this example from the sklearn docs.↩︎