Exercise 11: Bayesian networks among other models

The goal of this exercise is to practice with Bayesian Networks and compare its predictions against other models seen so far.

Getting started

We’ll use the breast cancer data from the UC Irvine Machine Learning Repository. The dataset has 10 columns, which are described as follows:

Variable Name	Role	Type	Demographic	Description	Units	Missing Values
Class	Target	Binary		no-recurrence-events, recurrence-events		no
age	Feature	Categorical	Age	10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99	years	no
menopause	Feature	Categorical		lt40, ge40, premeno		no
tumor-size	Feature	Categorical		0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59		no
inv-nodes	Feature	Categorical		0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39		no
node-caps	Feature	Binary		yes, no		yes
deg-malig	Feature	Integer		1, 2, 3		no
breast	Feature	Binary		left, right		no
breast-quad	Feature	Categorical		left-up, left-low, right-up, right-low, central		yes
irradiat	Feature	Binary		yes, no		no

Note that our target variable is “Class”, thus what we are trying to predict is if the case of breast cancer will present recurrence events.

Installing sorobn and importing libraries

Since we will be training a Bayesian network model, we will have to use a Python library for that (believe: you wouldn’t want to implement all the algorithms in just one exercise).

We’ll be using the sorobn library, which can be installed by typing in terminal:

pip install sorobn graphviz

Now we are ready to import our libraries:

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.metrics import accuracy_score
import pandas as pd
import sorobn as hh

Loading and wrangling data

Dwnload the data from the website
Get the file breast-cancer.data, and put it in the data folder
Load it in python using pd.read_csv().

Since this file does not contain the column names, we are specifying them “by hand” when loading:

column_names = ['class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat']
breast = pd.read_csv("data/breast-cancer.data", names=column_names)

As you can observe in the data description, two variables of our dataset have missing data: node-caps and breast-quad. Since we are still not seeing how to deal with those situations (which can give different problems in different models), we’ll drop these variables for now. (Maybe you want to include them later to see what happens).

For that, you can run:

breast = breast.drop(['node-caps','breast-quad'], axis=1)

Part 1: Supervised learning with previous models

Before training our Bayesian network, let’s see what how we can predict with models we already used.

To make things more reusable, try to set variables indicating what are our target and features columns. For example, since our target is the class column and our features are all the other ones, you can set:

target = 'class'
features = [i for i in breast.columns if i != target]

Part 1a: One-hot encoding of categorical variables

Our dataset consists of categorical variables, and we haven’t seen exactly how to use them as features with the sklearn library.

We will have to one-hot encode all of these features using a function available in the pandas library.

Execute the following code and then observe what happened:

breast_onehot = pd.get_dummies(breast, columns = features)

Answer: * What are the feature columns now? How many are they in total?

Part 1b: Train-test split and model fit

With our features having been one-hot encoded, we are ready for train-test splitting them. Split the dataset into 50% training data and 50% testing data.

Now, fit any model we have seen previously using the train data. Remember not to consider the target column as a feature.

Indicate which model you are using. Suggestions are decision trees, random forests, or logistic regression.

Part 1c: Accuracy metrics

Now, let’s evaluate the accuracy of our model on our test dataset.

Make predictions with your model using the test dataset. For that, use the method predict in the model and remember to only pass feature variables to it (not target).
Now, use the function accuracy_score, from sklearn, to check our predictions against the true values of our target variable (in our case, 'class'). How much was it?

Part 2: The Bayesian network model

We will now check how well a Bayesian network model will perform.

However, it is very important to know that the reason for using a Bayesian network is not just increasing performance, but also:

Being able to know probabilities on the outcomes (predictions), and
Calculate these probabilities even in the case of some features having missing data.

Part 2a: Learning the structure

First, we will have to learn the Bayesian network structure. For that, run:

structure = hh.structure.chow_liu(breast)
bn = hh.BayesNet(*structure)

This means we are using the Chow-Liu algorithm to find (one of) the best network structure possible given the data.

You can visualize the network structure by running:

dot = bn.graphviz()
print(dot.source)

The print above will return a graph structure according to the Graphviz dot syntax. You can copy it and paste it in Quarto itself, which then will render the graph. For details on how to render graphs in Quarto, check this documentation.

Show the resulting graph and reflect on it.

A big question is: does this graph really reflects some knowledge on how a variable would cause another? (Clearly no). This happens because network structure learning algorithms are trying only to find conditional independences between the variable. However, once we have these, multiple graph structures are possible, and some of them would really not make sense in a “causal” way. Remember what we said in class: Bayesian networks may show causality, but not always — sometimes they are just encoding some statistical relationships between variables (which may be useful enough, however).

Part 2b: Learning the parameters

Once we have our network structure, we can obtain the conditional probability distributions for our variables using data. Just run:

bn = bn.fit(breast)

Now we are ready for queries, inferences, predictions, etc.

Part 3: Queries and predictions with a Bayesian network

There are lots of operations we can perform with a Bayesian network, and some of them are explored in the tutorial for the sorobn library. For now, we’ll only make queries and check the resulting probabilities of our target variable.

Part 3a: Queries with few features

Just for testing, try making a query on the probabilities of class just by using knowledge of one of the variables. For example, you may want to know the probability of class given that age is 30-39:

bn.query('class', event={'age': '30-39'})

Try changing some of the features, or even changing the target variable (instead of class, try checking probabilities on deg-malig, or breast, given information on other variables), and see which probabilities are returned. Put some examples, show them here.

Part 3b: Predicting with all the features

Now, let’s make queries using all our feature variables. However, keep note that since we are identifying conditional independencies between these variables, it may happen that the knowledge of one variable wouldn’t add anything to our prediction. (Optional: can you point out which variables are these, given our network?).

In any case, what we can do is convert our data rows to dictionaries and passing them to our queries. See:

predictions = []
for i in breast.to_dict('records'): # converting our database entries to dictionaries
  del i[target] # dropping the target variable from the dict (only features then are remaining)
  probs = n.query(target, event=i)  # getting probabilities on the target variable given our features
  predictions.append(probs.idxmax()) # add to a list what will be the most likely value (with the highest probability)

Part 3c: Accuracy metrics

Now we are ready to compare our predictions with the true values of the target variable. Calculate the accuracy by using:

accuracy_score(
    y_true=breast[target],
    y_pred=predictions
)

How much is this? Is it better than our previous model?

Part 4: Tweaking, visualizing, reflecting…

There are still interesting things you can try: changing the prediction model used previously and comparing it again with our Bayesian network. Or at least you can try tweaking some of the parameters of the model used.

However, as we noted above, the Bayesian network presents the advantage of giving us probabilities on our predictions. Which can lead us to answer: how “good” are these probabilities? Are we really certain of our predictions? How can we check that?