Exercise 11: Bayesian networks among other models

The goal of this exercise is to practice with Bayesian Networks and compare its predictions against other models seen so far.

Getting started

We’ll use the breast cancer data from the UC Irvine Machine Learning Repository. The dataset has 10 columns, which are described as follows:

Variable Name Role Type Demographic Description Units Missing Values
Class Target Binary no-recurrence-events, recurrence-events no
age Feature Categorical Age 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99 years no
menopause Feature Categorical lt40, ge40, premeno no
tumor-size Feature Categorical 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59 no
inv-nodes Feature Categorical 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39 no
node-caps Feature Binary yes, no yes
deg-malig Feature Integer 1, 2, 3 no
breast Feature Binary left, right no
breast-quad Feature Categorical left-up, left-low, right-up, right-low, central yes
irradiat Feature Binary yes, no no

Note that our target variable is “Class”, thus what we are trying to predict is if the case of breast cancer will present recurrence events.

Installing sorobn and importing libraries

Since we will be training a Bayesian network model, we will have to use a Python library for that (believe: you wouldn’t want to implement all the algorithms in just one exercise).

We’ll be using the sorobn library, which can be installed by typing in terminal:

pip install sorobn graphviz

Now we are ready to import our libraries:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.metrics import accuracy_score
import pandas as pd
import sorobn as hh

Loading and wrangling data

  • Dwnload the data from the website
  • Get the file breast-cancer.data, and put it in the data folder
  • Load it in python using pd.read_csv().

Since this file does not contain the column names, we are specifying them “by hand” when loading:

column_names = ['class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat']
breast = pd.read_csv("data/breast-cancer.data", names=column_names)

As you can observe in the data description, two variables of our dataset have missing data: node-caps and breast-quad. Since we are still not seeing how to deal with those situations (which can give different problems in different models), we’ll drop these variables for now. (Maybe you want to include them later to see what happens).

For that, you can run:

breast = breast.drop(['node-caps','breast-quad'], axis=1)

Part 1: Supervised learning with previous models

Before training our Bayesian network, let’s see what how we can predict with models we already used.

To make things more reusable, try to set variables indicating what are our target and features columns. For example, since our target is the class column and our features are all the other ones, you can set:

target = 'class'
features = [i for i in breast.columns if i != target]

Part 1a: One-hot encoding of categorical variables

Our dataset consists of categorical variables, and we haven’t seen exactly how to use them as features with the sklearn library.

We will have to one-hot encode all of these features using a function available in the pandas library.

Execute the following code and then observe what happened:

breast_onehot = pd.get_dummies(breast, columns = features)

Answer: * What are the feature columns now? How many are they in total?

Part 1b: Train-test split and model fit

With our features having been one-hot encoded, we are ready for train-test splitting them. Split the dataset into 50% training data and 50% testing data.

Now, fit any model we have seen previously using the train data. Remember not to consider the target column as a feature.

Indicate which model you are using. Suggestions are decision trees, random forests, or logistic regression.

Part 1c: Accuracy metrics

Now, let’s evaluate the accuracy of our model on our test dataset.

  • Make predictions with your model using the test dataset. For that, use the method predict in the model and remember to only pass feature variables to it (not target).
  • Now, use the function accuracy_score, from sklearn, to check our predictions against the true values of our target variable (in our case, 'class'). How much was it?

Part 2: The Bayesian network model

We will now check how well a Bayesian network model will perform.

However, it is very important to know that the reason for using a Bayesian network is not just increasing performance, but also:

  1. Being able to know probabilities on the outcomes (predictions), and
  2. Calculate these probabilities even in the case of some features having missing data.

Part 2a: Learning the structure

First, we will have to learn the Bayesian network structure. For that, run:

structure = hh.structure.chow_liu(breast)
bn = hh.BayesNet(*structure)

This means we are using the Chow-Liu algorithm to find (one of) the best network structure possible given the data.

You can visualize the network structure by running:

dot = bn.graphviz()
print(dot.source)

The print above will return a graph structure according to the Graphviz dot syntax. You can copy it and paste it in Quarto itself, which then will render the graph. For details on how to render graphs in Quarto, check this documentation.

Show the resulting graph and reflect on it.

A big question is: does this graph really reflects some knowledge on how a variable would cause another? (Clearly no). This happens because network structure learning algorithms are trying only to find conditional independences between the variable. However, once we have these, multiple graph structures are possible, and some of them would really not make sense in a “causal” way. Remember what we said in class: Bayesian networks may show causality, but not always — sometimes they are just encoding some statistical relationships between variables (which may be useful enough, however).

Part 2b: Learning the parameters

Once we have our network structure, we can obtain the conditional probability distributions for our variables using data. Just run:

bn = bn.fit(breast)

Now we are ready for queries, inferences, predictions, etc.

Part 3: Queries and predictions with a Bayesian network

There are lots of operations we can perform with a Bayesian network, and some of them are explored in the tutorial for the sorobn library. For now, we’ll only make queries and check the resulting probabilities of our target variable.

Part 3a: Queries with few features

Just for testing, try making a query on the probabilities of class just by using knowledge of one of the variables. For example, you may want to know the probability of class given that age is 30-39:

bn.query('class', event={'age': '30-39'})

Try changing some of the features, or even changing the target variable (instead of class, try checking probabilities on deg-malig, or breast, given information on other variables), and see which probabilities are returned. Put some examples, show them here.

Part 3b: Predicting with all the features

Now, let’s make queries using all our feature variables. However, keep note that since we are identifying conditional independencies between these variables, it may happen that the knowledge of one variable wouldn’t add anything to our prediction. (Optional: can you point out which variables are these, given our network?).

In any case, what we can do is convert our data rows to dictionaries and passing them to our queries. See:

predictions = []
for i in breast.to_dict('records'): # converting our database entries to dictionaries
  del i[target] # dropping the target variable from the dict (only features then are remaining)
  probs = n.query(target, event=i)  # getting probabilities on the target variable given our features
  predictions.append(probs.idxmax()) # add to a list what will be the most likely value (with the highest probability)

Part 3c: Accuracy metrics

Now we are ready to compare our predictions with the true values of the target variable. Calculate the accuracy by using:

accuracy_score(
    y_true=breast[target],
    y_pred=predictions
)

How much is this? Is it better than our previous model?

Part 4: Tweaking, visualizing, reflecting…

There are still interesting things you can try: changing the prediction model used previously and comparing it again with our Bayesian network. Or at least you can try tweaking some of the parameters of the model used.

However, as we noted above, the Bayesian network presents the advantage of giving us probabilities on our predictions. Which can lead us to answer: how “good” are these probabilities? Are we really certain of our predictions? How can we check that?