Exercise 9: Practice with Supervised Learning

In this exercise, we’ll practice setting up a supervised learning problem. Completing this exercise will help you be able to:

Set up a supervised learning task: identifying the target variable and features
Split data into training and test sets
Train a model and evaluate its performance

Getting Started

Make a Quarto notebook for this exercise. Here are the imports you’ll need.

```{python}
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
import pandas as pd

import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"
```

Wrangle Data

We’ll be using a dataset of home sales in Ames, Iowa. Each row is a home that was sold. The dataset includes lots of information about each home, such as its size, the year it was built, and its location. It also includes the sale price of each home.

Here is the data dictionary that the author provided, and the academic paper that describes the dataset.

Download the dataset from this link and put it in the data folder. Then read it into a Pandas DataFrame. Look at the .info() and head as usual, although you may not want to include them in your final report because there are so many columns.

Note

If you try to view this data in the RStudio data viewer, you won’t see all of the columns at first because it only shows 50 at a time. At the top of the data viewer, you can click the right and left arrows to show a different set of columns.

Part 1: Scale the data

First, write a brief description of the shape of the dataframe.

Then, perform the following data wrangling steps:

Scale the Sale_Price column by dividing by 1000. This will make the numbers easier to read. (The units will be thousands of dollars.)
Rename the Sale_Price column to sale_price.

Part 2: Filter the data

The author of the original suggested that we work with a subset of the data where the homes are:

Less than 4000 square feet of above-grade living area and
Sold as a “Normal Sale” (instead of foreclosure, etc.)

Filter the data to only include those homes. Use the data dictionary to look up what columns correspond to these two variables, and what values of those columns correspond to the conditions we want.

Check that you obtain 2412 homes.

Set up the supervised learning task

We’ll use this data to try to predict sale price in thousands of dollars from two features: above ground living area and year built.

Part 3: Describe the supervised learning task

Task description

Write a brief description of the task we’ll be performing in English: What’s the target variable? What are the features? What metrics will we use to evaluate the model? How will we make sure that we can evaluate the model on data that it wasn’t trained on?

We haven’t yet specified the metrics to use; come back to this part after you read the rest of this section.

Now in code: Use the data dictionary to find the column names for these columns. Then make a list of the feature columns and a string for the target column.

feature_columns = [..., ...]
target_column = ...

Train-Test Split

Remember that our goal is to be able to predict what homes will sell for before they’re sold.

But our dataset has only homes that were already sold. How can we possibly figure out how well we’d predict a sale price before it’s sold?

Our strategy, which we’ll discuss more in future weeks, will be to hold out a “testing set” of homes. We won’t let our model see the actual sale price for these homes.

The homes where we do show the model the sale price we’ll call the “training” homes.

We’ll make this split randomly but consistently: we’ll first seed the random number generator so it always gives the same sequence of numbers.

Use the train_test_split function to split the data into training and test sets. Use a random_state of 42 and a test_size of 0.2. Check the number of homes in the training and test sets to make sure they are correct.

Training set size: 1929 homes
Test set size: 483 homes

We’ll use mean_absolute_error and mean_absolute_percentage_error to evaluate the model. Here’s a function that will compute both metrics at the same time.

```{python}
def evaluate(y_true, y_pred):
    return pd.Series({
        'MAE': mean_absolute_error(y_true, y_pred),
        'MAPE': mean_absolute_percentage_error(y_true, y_pred),
        # 'MSE': mean_squared_error(y_true, y_pred),
    })
```

EDA

To keep this short, let’s make only the most essential EDA plots.

Part 4: Distribution of the target variable

Make a histogram of the target variable. Use 50 bins. Are there any extremely large or small values compared with the rest of the data?

Part 5: Target vs features

Make the following two plots, then write a sentence about what they tell you about the relationships of each feature to the target variable.

Make a scatter plot of sale_price vs. Gr_Liv_Area. Add a lowess trendline. I used a marker_size of 3 to make the points smaller.

Make a scatter plot of sale_price by Year_Built.

Fit models

Part 6: Linear Regression

Here is code to train and evaluate a linear regression model on the training set.

```{python}
lr = LinearRegression().fit(
    X=ames_train[feature_columns],
    y=ames_train[target_column]
)
ames_train['linreg_prediction'] = lr.predict(ames_train[feature_columns])
evaluate(ames_train[target_column], ames_train['linreg_prediction'])
```

MAE     27.541263
MAPE     0.161787
dtype: float64

Your turn: evaluate the model on the test set. (Remember what we never to with test sets?)

MAE     28.371445
MAPE     0.168235
dtype: float64

Write a description of the model’s performance on the test set in plain English. Include both the MAE and MAPE metrics. (Look at the manual page for mean_absolute_percentage_error to see how to interpret the MAPE metric.)

Part 7: Decision Tree

Do the same, but use a DecisionTreeRegressor(max_depth=3) instead of a linear regression model. (You can copy and paste the code from above, but be careful to change all the variable names.)

MAE     29.851262
MAPE     0.182240
dtype: float64

MAE     31.239749
MAPE     0.191820
dtype: float64

Compare the two models

The following code makes a tidy dataset of the model predictions and residuals. Study how it works; you might find this data wrangling useful.

```{python}
ames_test_by_model = ames_test.melt(
    id_vars=['sale_price'] + feature_columns,
    value_vars=['linreg_prediction', 'dt_prediction'],
    var_name='model',
    value_name='prediction')

ames_test_by_model['model'] = ames_test_by_model['model'].replace({
    'linreg_prediction': 'Linear Regression',
    'dt_prediction': 'Decision Tree'
})

ames_test_by_model['resid'] = ames_test_by_model['sale_price'] - ames_test_by_model['prediction']
ames_test_by_model.head()
```

	sale_price	Gr_Liv_Area	Year_Built	model	prediction	resid
0	131.0	1008	1956	Linear Regression	117.862663	13.137337
1	135.5	960	1955	Linear Regression	112.321404	23.178596
2	146.5	1652	1970	Linear Regression	192.755718	-46.255718
3	183.5	1525	1997	Linear Regression	205.935490	-22.435490
4	147.9	1097	1996	Linear Regression	163.960631	-16.060631

Part 8: Compare the models

Make an actual-vs-predicted scatter plot for each model. Use facet_col to put the two plots side-by-side.

Make a box plot of the residuals for each model.

Write a brief description of which model worked better for this task.

Part 9: Tweak hyperparameters

Try changing the max_depth of the decision tree. What happens to the model’s performance on the training set? What happens to the model’s performance on the test set?

Try adding a feature for number of bedrooms. What happens to the model’s performance on the training set? What happens to the model’s performance on the test set?