```{python}
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.templates.default = "plotly_white"
```
Exercise 9: Practice with Supervised Learning
In this exercise, we’ll practice setting up a supervised learning problem. Completing this exercise will help you be able to:
- Set up a supervised learning task: identifying the target variable and features
- Split data into training and test sets
- Train a model and evaluate its performance
Getting Started
Make a Quarto notebook for this exercise. Here are the imports you’ll need.
Wrangle Data
We’ll be using a dataset of home sales in Ames, Iowa. Each row is a home that was sold. The dataset includes lots of information about each home, such as its size, the year it was built, and its location. It also includes the sale price of each home.
Here is the data dictionary that the author provided, and the academic paper that describes the dataset.
Download the dataset from this link and put it in the data
folder. Then read it into a Pandas DataFrame. Look at the .info()
and head
as usual, although you may not want to include them in your final report because there are so many columns.
If you try to view this data in the RStudio data viewer, you won’t see all of the columns at first because it only shows 50 at a time. At the top of the data viewer, you can click the right and left arrows to show a different set of columns.
Part 1: Scale the data
First, write a brief description of the shape of the dataframe.
Then, perform the following data wrangling steps:
- Scale the
Sale_Price
column by dividing by 1000. This will make the numbers easier to read. (The units will be thousands of dollars.) - Rename the
Sale_Price
column tosale_price
.
Part 2: Filter the data
The author of the original suggested that we work with a subset of the data where the homes are:
- Less than 4000 square feet of above-grade living area and
- Sold as a “Normal Sale” (instead of foreclosure, etc.)
Filter the data to only include those homes. Use the data dictionary to look up what columns correspond to these two variables, and what values of those columns correspond to the conditions we want.
Check that you obtain 2412 homes.
Set up the supervised learning task
We’ll use this data to try to predict sale price in thousands of dollars from two features: above ground living area and year built.
Part 3: Describe the supervised learning task
Task description
Write a brief description of the task we’ll be performing in English: What’s the target variable? What are the features? What metrics will we use to evaluate the model? How will we make sure that we can evaluate the model on data that it wasn’t trained on?
We haven’t yet specified the metrics to use; come back to this part after you read the rest of this section.
Now in code: Use the data dictionary to find the column names for these columns. Then make a list of the feature columns and a string for the target column.
= [..., ...]
feature_columns = ... target_column
Train-Test Split
Remember that our goal is to be able to predict what homes will sell for before they’re sold.
But our dataset has only homes that were already sold. How can we possibly figure out how well we’d predict a sale price before it’s sold?
Our strategy, which we’ll discuss more in future weeks, will be to hold out a “testing set” of homes. We won’t let our model see the actual sale price for these homes.
The homes where we do show the model the sale price we’ll call the “training” homes.
We’ll make this split randomly but consistently: we’ll first seed the random number generator so it always gives the same sequence of numbers.
Use the train_test_split
function to split the data into training and test sets. Use a random_state
of 42 and a test_size
of 0.2. Check the number of homes in the training and test sets to make sure they are correct.
Training set size: 1929 homes
Test set size: 483 homes
We’ll use mean_absolute_error
and mean_absolute_percentage_error
to evaluate the model. Here’s a function that will compute both metrics at the same time.
```{python}
def evaluate(y_true, y_pred):
return pd.Series({
'MAE': mean_absolute_error(y_true, y_pred),
'MAPE': mean_absolute_percentage_error(y_true, y_pred),
# 'MSE': mean_squared_error(y_true, y_pred),
})
```
EDA
To keep this short, let’s make only the most essential EDA plots.
Part 4: Distribution of the target variable
Make a histogram of the target variable. Use 50 bins. Are there any extremely large or small values compared with the rest of the data?
Part 5: Target vs features
Make the following two plots, then write a sentence about what they tell you about the relationships of each feature to the target variable.
Make a scatter plot of sale_price
vs. Gr_Liv_Area
. Add a lowess
trendline. I used a marker_size
of 3 to make the points smaller.
Make a scatter plot of sale_price
by Year_Built
.
Fit models
Part 6: Linear Regression
Here is code to train and evaluate a linear regression model on the training set.
```{python}
lr = LinearRegression().fit(
X=ames_train[feature_columns],
y=ames_train[target_column]
)
ames_train['linreg_prediction'] = lr.predict(ames_train[feature_columns])
evaluate(ames_train[target_column], ames_train['linreg_prediction'])
```
MAE 27.541263
MAPE 0.161787
dtype: float64
Your turn: evaluate the model on the test set. (Remember what we never to with test sets?)
MAE 28.371445
MAPE 0.168235
dtype: float64
Write a description of the model’s performance on the test set in plain English. Include both the MAE and MAPE metrics. (Look at the manual page for mean_absolute_percentage_error to see how to interpret the MAPE metric.)
Part 7: Decision Tree
Do the same, but use a DecisionTreeRegressor(max_depth=3)
instead of a linear regression model. (You can copy and paste the code from above, but be careful to change all the variable names.)
MAE 29.851262
MAPE 0.182240
dtype: float64
MAE 31.239749
MAPE 0.191820
dtype: float64
Compare the two models
The following code makes a tidy dataset of the model predictions and residuals. Study how it works; you might find this data wrangling useful.
```{python}
ames_test_by_model = ames_test.melt(
id_vars=['sale_price'] + feature_columns,
value_vars=['linreg_prediction', 'dt_prediction'],
var_name='model',
value_name='prediction')
ames_test_by_model['model'] = ames_test_by_model['model'].replace({
'linreg_prediction': 'Linear Regression',
'dt_prediction': 'Decision Tree'
})
ames_test_by_model['resid'] = ames_test_by_model['sale_price'] - ames_test_by_model['prediction']
ames_test_by_model.head()
```
sale_price | Gr_Liv_Area | Year_Built | model | prediction | resid | |
---|---|---|---|---|---|---|
0 | 131.0 | 1008 | 1956 | Linear Regression | 117.862663 | 13.137337 |
1 | 135.5 | 960 | 1955 | Linear Regression | 112.321404 | 23.178596 |
2 | 146.5 | 1652 | 1970 | Linear Regression | 192.755718 | -46.255718 |
3 | 183.5 | 1525 | 1997 | Linear Regression | 205.935490 | -22.435490 |
4 | 147.9 | 1097 | 1996 | Linear Regression | 163.960631 | -16.060631 |
Part 8: Compare the models
Make an actual-vs-predicted scatter plot for each model. Use facet_col
to put the two plots side-by-side.
Make a box plot of the residuals for each model.
Write a brief description of which model worked better for this task.
Part 9: Tweak hyperparameters
Try changing the max_depth
of the decision tree. What happens to the model’s performance on the training set? What happens to the model’s performance on the test set?
Try adding a feature for number of bedrooms. What happens to the model’s performance on the training set? What happens to the model’s performance on the test set?