Exercise 12: Logistic Regression, Regularization, and Cross-Validation

In this exercise, we’ll practice classification using a logistic regression model. Since this model is prone to overfit, we’ll learn how to tame its excesses with regularization, and estimate how well it will generalize to new data using cross-validation.

Completing this exercise will help you be able to:

Fit a logistic regression model to a dataset using scikit-learn
Use regularization to prevent overfitting
Use cross-validation to estimate how well a model will generalize to new data

Along the way, we’ll also see a technique for handling missing data, and get more practice with how to build pipelines in scikit-learn.

Code will be provided for this exercise. You should retype the code on your own and try to understand what each part does. There will also be some conceptual questions for you to answer.

Getting Started

We’ll use a dataset of who survived the Titanic shipwreck. To help make the point of this exercise more clear, we’ll also augment the dataset with some irrelevant (noise) features. We’ll keep the original goal, though: to predict whether a passenger survived or not.¹

This task is a classic in machine learning, but I’ve never seen anyone mention how fake this prediction task is! There is no real-world scenario where you get to see whether some passengers survived and some didn’t, and then you have to predict whether other passengers survived or not. When you’re doing projects, please think about whether the task you’re trying to solve is (a) a task that anyone would actually care about and (b) a task that’s doable using data that you’d actually have access to.

Here are the imports you’ll need.

```{python}
import pandas as pd
import numpy as np

from sklearn.compose import make_column_transformer
from sklearn.datasets import fetch_openml, load_breast_cancer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, LeaveOneOut, cross_val_score, GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import accuracy_score

import plotly.express as px
```

We’ll load the data from a site called OpenML. OpenML is a repository of datasets that are useful for machine learning. Specifically we’ll load the Titanic dataset. This dataset contains information about passengers on the Titanic, including whether they survived or not. The Kaggle page for this dataset has more information about the features.

```{python}
titanic_all = fetch_openml(
    data_id=40945, as_frame=True, parser="pandas"
)
titanic = titanic_all.frame
```

Now we’ll add some noise features. Just copy-paste this code; don’t worry about how it works.

rng = np.random.RandomState(0)
n_samples = titanic.shape[0]
n_features = 500
noise_feature_names = [f"noise_{i}" for i in range(n_features)]
noise_features = pd.DataFrame(
    rng.normal(size=(n_samples, n_features)),
    columns=noise_feature_names
)
titanic = pd.concat([titanic, noise_features], axis=1)
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Columns: 514 entries, pclass to noise_499
dtypes: category(3), float64(503), int64(3), object(5)
memory usage: 5.1+ MB

We’ll use two numeric features (in addition to the noise features):

age: the age of the passenger
fare: the fare paid by the passenger

And three categorical features:

embarked: the port where the passenger embarked (C = Cherbourg, Q = Queenstown, S = Southampton)
sex
pclass: the passenger class (1 = first class, 2 = second class, 3 = third class)

Note that pclass is encoded as a number, but it’s really a categorical feature. We’ll treat it as categorical.

```{python}
# Select the features and target
target_column = "survived"
numeric_features = ["age", "fare"] + noise_feature_names
categorical_features = ["embarked", "sex", "pclass"]
feature_columns = numeric_features + categorical_features
```

Let’s hold out a test set. We’ll use a large test set so that our estimates of generalization error have less variance.

```{python}
titanic_train, titanic_test = train_test_split(
    titanic, test_size=.5, random_state=0
)
print(f"Training set size: {titanic_train.shape[0]}")
print(f"Test set size: {titanic_test.shape[0]}")
```

Training set size: 654
Test set size: 655

The numeric features will need some preprocessing. We’ll first scale them to be zero mean and unit variance, and then impute missing values. Note: at this step we’re just defining what processing will happen; the actual processing will be done later when we fit the pipeline.

```{python}
numeric_transformer = make_pipeline(
    StandardScaler(),
    SimpleImputer(strategy="median")
)
```

Pause to think:

What is strategy="median" doing? What alternative strategies could we use for imputing missing values? Do you think the median would be a good guess for the missing values?
Why do we need to impute missing values after scaling the features? What would happen if we did it the other way around?

The categorical features need to be one-hot-encoded.

```{python}
categorical_transformer = make_pipeline(
    SimpleImputer(strategy="most_frequent"),
    OneHotEncoder(handle_unknown="ignore", sparse_output=False)
)
```

Pause to think: How many columns will the categorical transformer produce? What will each column represent? Write a brief explanation.

We can combine the numeric and categorical transformers into a single transformer that will apply each one to the appropriate columns.

```{python}
preprocessor = make_column_transformer(
    (numeric_transformer, numeric_features),
    (categorical_transformer, categorical_features)
).set_output(transform='pandas')
```

Let’s show an example of what this preprocessor will do to our data. Check your results from above.

example_preprocessed_data = preprocessor.fit_transform(titanic_train[feature_columns])
example_preprocessed_data.info()
example_preprocessed_data.head()

<class 'pandas.core.frame.DataFrame'>
Index: 654 entries, 293 to 684
Columns: 510 entries, pipeline-1__age to pipeline-2__pclass_3
dtypes: float64(510)
memory usage: 2.5 MB

	pipeline-1__age	pipeline-1__fare	pipeline-1__noise_0	pipeline-1__noise_1	pipeline-1__noise_2	pipeline-1__noise_3	pipeline-1__noise_4	pipeline-1__noise_5	pipeline-1__noise_6	pipeline-1__noise_7	...	pipeline-1__noise_498	pipeline-1__noise_499	pipeline-2__embarked_C	pipeline-2__embarked_Q	pipeline-2__embarked_S	pipeline-2__sex_female	pipeline-2__sex_male	pipeline-2__pclass_1	pipeline-2__pclass_3
293	-0.140868	0.405374	0.534576	-0.178911	1.530273	0.527998	-1.732251	0.694583	0.769149	-1.213062	...	0.115889	-0.166542	0.0	0.0	1.0	1.0	0.0	1.0	0.0
76	0.649024	1.070188	-0.250568	-0.502139	0.217859	-1.081952	-1.166470	0.074495	0.160803	-1.291101	...	-1.461193	-0.095610	1.0	0.0	0.0	1.0	0.0	1.0	0.0
48	1.654342	-0.118530	-0.283955	1.668724	-0.704216	-1.267655	0.303582	-0.202661	0.905217	-0.310839	...	0.240206	0.980616	1.0	0.0	0.0	1.0	0.0	1.0	0.0
689	-0.571718	-0.549441	-0.739937	-0.022050	-1.425322	2.253745	0.053555	0.068434	-0.809622	1.080123	...	-1.503977	-0.938596	0.0	0.0	1.0	0.0	1.0	0.0	1.0
1195	-0.140868	-0.538772	0.097809	-0.949562	0.559721	0.118893	-0.004897	0.996909	-0.111381	-0.672056	...	-1.530474	2.113191	0.0	1.0	0.0	0.0	1.0	0.0	1.0

5 rows × 510 columns

Notice that there are no missing values and that there are separate columns for each of the possible values of the categorical variables.

Now we can put all of those elements together to train a basic logistic regression model.

```{python}
model = make_pipeline(
    preprocessor,
    LogisticRegression(penalty=None)
)
```

Your turn: fit and evaluate this model (using model.fit(...)). What is the accuracy on the training set? What is the accuracy on the test set?

Accuracy on training set: 1.000
Accuracy on test set: 0.652

What do you notice about the accuracy of the model on the training set vs the test set?

Regularization

The model was able to get a perfect score on the training set. But to do that, it used some of the noise features, which clearly will not generalize to new data. We can see this by looking at the coefficients of the model:

def get_coefs_df(model, feature_names):
    logreg = model.named_steps['logisticregression']
    assert logreg.coef_.shape[0] == 1
    assert logreg.coef_.shape[1] == len(feature_names)
    coefs_df = pd.DataFrame(dict(
        name=feature_names,
        coef=logreg.coef_[0]
    ))
    coefs_df['abs_coef'] = coefs_df['coef'].abs()
    return coefs_df.sort_values('abs_coef', ascending=False)

coefs_df = get_coefs_df(model, example_preprocessed_data.columns)
coefs_df.head(10)

	name	coef	abs_coef
506	pipeline-2__sex_male	-12.828261	12.828261
505	pipeline-2__sex_female	11.729937	11.729937
509	pipeline-2__pclass_3	-7.393828	7.393828
1	pipeline-1__fare	5.932705	5.932705
381	pipeline-1__noise_379	-5.126387	5.126387
507	pipeline-2__pclass_1	4.860676	4.860676
473	pipeline-1__noise_471	-4.829957	4.829957
104	pipeline-1__noise_102	4.396529	4.396529
193	pipeline-1__noise_191	-4.325752	4.325752
484	pipeline-1__noise_482	-4.149559	4.149559

coefs_df['is_noise'] = coefs_df['name'].str.contains('noise')
print("Total noise coefficients: {:.2f}".format(
    coefs_df.query('is_noise')['coef'].abs().sum())
)

Total noise coefficients: 617.16

If we could get the model to put less weight on the noise features, it might generalize better. But in a real example, we don’t know which features are noise. And some features may be partly noise, partly signal. So we can’t just remove the noise features.

Instead, we can use regularization to encourage the model to put less weight on any individual feature. The hope is that this will encourage the model to put more weight on the features that are signal, and less weight on the features that are noise.

```{python}
model = make_pipeline(
    preprocessor,
    LogisticRegression(C=1e-1, penalty='l1', solver='liblinear')
).fit(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
)
print("Accuracy on training set:", model.score(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
))
print("Accuracy on test set:", model.score(
    X=titanic_test[feature_columns],
    y=titanic_test[target_column]
))
```

Accuracy on training set: 0.8669724770642202
Accuracy on test set: 0.76793893129771

coefs_df = get_coefs_df(model, example_preprocessed_data.columns)
coefs_df.head(10)

	name	coef	abs_coef
506	pipeline-2__sex_male	-1.213080	1.213080
505	pipeline-2__sex_female	0.934112	0.934112
509	pipeline-2__pclass_3	-0.544729	0.544729
507	pipeline-2__pclass_1	0.422453	0.422453
473	pipeline-1__noise_471	-0.216718	0.216718
193	pipeline-1__noise_191	-0.190734	0.190734
381	pipeline-1__noise_379	-0.167910	0.167910
389	pipeline-1__noise_387	-0.161352	0.161352
484	pipeline-1__noise_482	-0.152020	0.152020
296	pipeline-1__noise_294	-0.138615	0.138615

coefs_df['is_noise'] = coefs_df['name'].str.contains('noise')
print("Total weight on noise coefficients: {:.2f}".format(
    coefs_df.query('is_noise')['coef'].abs().sum())
)

Total weight on noise coefficients: 5.27

Your turn: what do you notice about the model’s accuracy scores now? What do you notice about the coefficients?

Your turn: Try adjusting the value of C in the model. Answer:

How does the training accuracy change as C changes?
What value of C gives the best training accuracy?
How does the test accuracy change as C changes?
What value of C gives the best test accuracy?
How does the amount of weight that the model puts on the noise features change as C changes?
Does a high value of C seem to correspond to regularizing a lot or a little?

This plot might help you answer some of those questions:

# Make a plot of train and test accuracy vs C
C_values_to_try = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]
train_accuracies = []
test_accuracies = []
for C in C_values_to_try:
    model = make_pipeline(
        preprocessor,
        LogisticRegression(C=C, penalty='l1', solver='liblinear')
    ).fit(
        X=titanic_train[feature_columns],
        y=titanic_train[target_column]
    )
    train_accuracies.append(model.score(
        X=titanic_train[feature_columns],
        y=titanic_train[target_column]
    ))
    test_accuracies.append(model.score(
        X=titanic_test[feature_columns],
        y=titanic_test[target_column]
    ))
accuracies_df = pd.DataFrame(dict(
    C=C_values_to_try,
    train_accuracy=train_accuracies,
    test_accuracy=test_accuracies
)).melt(id_vars='C', var_name='dataset', value_name='accuracy')

px.line(accuracies_df, x='C', y='accuracy', color='dataset', log_x=True)

Cross-Validation

We’ve seen that regularization can help prevent overfitting. But how do we know how much regularization to use? We could try different values of the regularization parameter C and see which one gives the best accuracy on the test set. But that would be a waste of data. We’d be using the test set to choose the model, which would make it less reliable as an estimate of generalization error.

Instead, we can use cross-validation to estimate how well the model will generalize to new data. The idea is to hold out one of the data points, train the model on the rest, and then see how well it predicts the held-out data point. We can repeat this process for each data point, and then average the scores to get an estimate of how well the model will generalize.

Buuut… that requires training the model as many times as there are data points. That’s a lot of training! (I tried it and got tired of waiting.) So, instead of holding out one data point at a time, we can hold out a bunch of points at a time. These groups of points are called folds. We then test the model on each fold, using the other folds as the training set. We can then average the scores on each fold to get an estimate of how well the model will generalize.

Your turn: If we have 100 data points and we use 10-fold cross-validation, how many data points will be in each fold? How many times will we train a model? How big will the training set be each time? How big will the test set be each time?

To use cross-validation to find a good value of C, we can use scikit-learn’s GridSearchCV class. This class will try each value of C in C_values_to_try, and use cross-validation to estimate how well the model will generalize.

```{python}
C_values_to_try = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10, 100]

model = make_pipeline(
    preprocessor,
    LogisticRegression(C=C, penalty='l1', solver='liblinear')
)
search = GridSearchCV(
    model,
    param_grid=dict(
        logisticregression__C=C_values_to_try
    ),
    cv=5,
    verbose=4
).fit(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
)
```

Fitting 5 folds for each of 7 candidates, totalling 35 fits
[CV 1/5] END ......logisticregression__C=0.0001;, score=0.618 total time=   0.0s
[CV 2/5] END ......logisticregression__C=0.0001;, score=0.618 total time=   0.1s
[CV 3/5] END ......logisticregression__C=0.0001;, score=0.618 total time=   0.1s
[CV 4/5] END ......logisticregression__C=0.0001;, score=0.618 total time=   0.1s
[CV 5/5] END ......logisticregression__C=0.0001;, score=0.615 total time=   0.1s
[CV 1/5] END .......logisticregression__C=0.001;, score=0.618 total time=   0.1s
[CV 2/5] END .......logisticregression__C=0.001;, score=0.618 total time=   0.1s
[CV 3/5] END .......logisticregression__C=0.001;, score=0.618 total time=   0.1s
[CV 4/5] END .......logisticregression__C=0.001;, score=0.618 total time=   0.1s
[CV 5/5] END .......logisticregression__C=0.001;, score=0.615 total time=   0.1s
[CV 1/5] END ........logisticregression__C=0.01;, score=0.618 total time=   0.1s
[CV 2/5] END ........logisticregression__C=0.01;, score=0.618 total time=   0.1s
[CV 3/5] END ........logisticregression__C=0.01;, score=0.618 total time=   0.1s
[CV 4/5] END ........logisticregression__C=0.01;, score=0.618 total time=   0.1s
[CV 5/5] END ........logisticregression__C=0.01;, score=0.615 total time=   0.1s
[CV 1/5] END .........logisticregression__C=0.1;, score=0.748 total time=   0.1s
[CV 2/5] END .........logisticregression__C=0.1;, score=0.740 total time=   0.1s
[CV 3/5] END .........logisticregression__C=0.1;, score=0.771 total time=   0.1s
[CV 4/5] END .........logisticregression__C=0.1;, score=0.771 total time=   0.1s
[CV 5/5] END .........logisticregression__C=0.1;, score=0.738 total time=   0.1s
[CV 1/5] END ...........logisticregression__C=1;, score=0.626 total time=   0.1s
[CV 2/5] END ...........logisticregression__C=1;, score=0.687 total time=   0.1s
[CV 3/5] END ...........logisticregression__C=1;, score=0.679 total time=   0.1s
[CV 4/5] END ...........logisticregression__C=1;, score=0.649 total time=   0.1s
[CV 5/5] END ...........logisticregression__C=1;, score=0.623 total time=   0.1s
[CV 1/5] END ..........logisticregression__C=10;, score=0.611 total time=   0.1s
[CV 2/5] END ..........logisticregression__C=10;, score=0.679 total time=   0.1s
[CV 3/5] END ..........logisticregression__C=10;, score=0.679 total time=   0.1s
[CV 4/5] END ..........logisticregression__C=10;, score=0.626 total time=   0.1s
[CV 5/5] END ..........logisticregression__C=10;, score=0.631 total time=   0.1s
[CV 1/5] END .........logisticregression__C=100;, score=0.588 total time=   0.1s
[CV 2/5] END .........logisticregression__C=100;, score=0.695 total time=   0.1s
[CV 3/5] END .........logisticregression__C=100;, score=0.672 total time=   0.1s
[CV 4/5] END .........logisticregression__C=100;, score=0.626 total time=   0.1s
[CV 5/5] END .........logisticregression__C=100;, score=0.615 total time=   0.1s

Let’s see what the cross validation process estimated for each value of C.

cv_results = pd.DataFrame(search.cv_results_)
cv_results

	mean_fit_time	std_fit_time	mean_score_time	std_score_time	param_logisticregression__C	params	split0_test_score	split1_test_score	split2_test_score	split3_test_score	split4_test_score	mean_test_score	std_test_score	rank_test_score
0	0.050302	0.005591	0.013186	0.002130	0.0001	{'logisticregression__C': 0.0001}	0.618321	0.618321	0.618321	0.618321	0.615385	0.617733	0.001174	5
1	0.053057	0.000240	0.014282	0.001332	0.001	{'logisticregression__C': 0.001}	0.618321	0.618321	0.618321	0.618321	0.615385	0.617733	0.001174	5
2	0.053784	0.000100	0.013579	0.000115	0.01	{'logisticregression__C': 0.01}	0.618321	0.618321	0.618321	0.618321	0.615385	0.617733	0.001174	5
3	0.061919	0.001214	0.013345	0.000388	0.1	{'logisticregression__C': 0.1}	0.748092	0.740458	0.770992	0.770992	0.738462	0.753799	0.014402	1
4	0.101603	0.013110	0.009218	0.000028	1	{'logisticregression__C': 1}	0.625954	0.687023	0.679389	0.648855	0.623077	0.652860	0.026450	2
5	0.112736	0.007271	0.009206	0.000023	10	{'logisticregression__C': 10}	0.610687	0.679389	0.679389	0.625954	0.630769	0.645238	0.028662	3
6	0.096419	0.003384	0.009305	0.000087	100	{'logisticregression__C': 100}	0.587786	0.694656	0.671756	0.625954	0.615385	0.639107	0.038785	4

Let’s reshape that data frame to make it easier to plot.

cv_results_melted = cv_results.melt(
    id_vars='param_logisticregression__C',
    value_vars=['split{}_test_score'.format(i) for i in range(5)],
    var_name='fold',
    value_name='score'
).rename(columns=dict(
    param_logisticregression__C='C'
))
cv_results_melted

	C	fold	score
0	0.0001	split0_test_score	0.618321
1	0.001	split0_test_score	0.618321
2	0.01	split0_test_score	0.618321
3	0.1	split0_test_score	0.748092
4	1	split0_test_score	0.625954
5	10	split0_test_score	0.610687
6	100	split0_test_score	0.587786
7	0.0001	split1_test_score	0.618321
8	0.001	split1_test_score	0.618321
9	0.01	split1_test_score	0.618321
10	0.1	split1_test_score	0.740458
11	1	split1_test_score	0.687023
12	10	split1_test_score	0.679389
13	100	split1_test_score	0.694656
14	0.0001	split2_test_score	0.618321
15	0.001	split2_test_score	0.618321
16	0.01	split2_test_score	0.618321
17	0.1	split2_test_score	0.770992
18	1	split2_test_score	0.679389
19	10	split2_test_score	0.679389
20	100	split2_test_score	0.671756
21	0.0001	split3_test_score	0.618321
22	0.001	split3_test_score	0.618321
23	0.01	split3_test_score	0.618321
24	0.1	split3_test_score	0.770992
25	1	split3_test_score	0.648855
26	10	split3_test_score	0.625954
27	100	split3_test_score	0.625954
28	0.0001	split4_test_score	0.615385
29	0.001	split4_test_score	0.615385
30	0.01	split4_test_score	0.615385
31	0.1	split4_test_score	0.738462
32	1	split4_test_score	0.623077
33	10	split4_test_score	0.630769
34	100	split4_test_score	0.615385

px.box(
    cv_results_melted,
    x="C", y="score", log_x=True,
    title="Cross-validation scores", labels=dict(
        C="Regularization strength (C)",
        score="Accuracy"
))

It will then choose the value of C that gives the best cross-validation score, and train a model using that value of C on the entire training set. The grid-search process estimated that the best value of C is:

search.best_params_

{'logisticregression__C': 0.1}

We can use that model to make predictions on the test set.

print("Accuracy on training set:", search.score(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
))
print("Accuracy on test set:", search.score(
    X=titanic_test[feature_columns],
    y=titanic_test[target_column]
))

Accuracy on training set: 0.8669724770642202
Accuracy on test set: 0.76793893129771

Your turn: How did that test set accuracy compare with the accuracy estimated by cross validation?

Gradient Boosting

Just to leave you one last good example: gradient boosting classifiers are often a good choice for many problems. Let’s try one on this dataset.

from sklearn.ensemble import HistGradientBoostingClassifier

Since gradient boosting models can handle missing data and don’t require scaling, let’s use an even simpler pipeline.

model = make_pipeline(
    make_column_transformer(
        ('passthrough', numeric_features),
        (OneHotEncoder(handle_unknown="ignore", sparse_output=False), categorical_features)
    ),
    HistGradientBoostingClassifier()
).fit(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
)

print("Accuracy on training set:", model.score(
    X=titanic_train[feature_columns],
    y=titanic_train[target_column]
))

Accuracy on training set: 1.0

print("Accuracy on test set:", model.score(
    X=titanic_test[feature_columns],
    y=titanic_test[target_column]
))

Accuracy on test set: 0.783206106870229

How did it do that? Read about gradient boosted trees in the sklearn docs!

Note that gradient boosting can also be used for regression tasks (HistGradientBoostingRegressor).

Footnotes

This exercise is based in part on this example from the sklearn docs.↩︎