from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, mean_absolute_error, mean_absolute_percentage_error
from sklearn.metrics import accuracy_score
import pandas as pd
import sorobn as hh
Exercise 11: Bayesian networks among other models
The goal of this exercise is to practice with Bayesian Networks and compare its predictions against other models seen so far.
Getting started
We’ll use the breast cancer data from the UC Irvine Machine Learning Repository. The dataset has 10 columns, which are described as follows:
Variable Name | Role | Type | Demographic | Description | Units | Missing Values |
---|---|---|---|---|---|---|
Class | Target | Binary | no-recurrence-events, recurrence-events | no | ||
age | Feature | Categorical | Age | 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70-79, 80-89, 90-99 | years | no |
menopause | Feature | Categorical | lt40, ge40, premeno | no | ||
tumor-size | Feature | Categorical | 0-4, 5-9, 10-14, 15-19, 20-24, 25-29, 30-34, 35-39, 40-44, 45-49, 50-54, 55-59 | no | ||
inv-nodes | Feature | Categorical | 0-2, 3-5, 6-8, 9-11, 12-14, 15-17, 18-20, 21-23, 24-26, 27-29, 30-32, 33-35, 36-39 | no | ||
node-caps | Feature | Binary | yes, no | yes | ||
deg-malig | Feature | Integer | 1, 2, 3 | no | ||
breast | Feature | Binary | left, right | no | ||
breast-quad | Feature | Categorical | left-up, left-low, right-up, right-low, central | yes | ||
irradiat | Feature | Binary | yes, no | no |
Note that our target variable is “Class”, thus what we are trying to predict is if the case of breast cancer will present recurrence events.
Installing sorobn and importing libraries
Since we will be training a Bayesian network model, we will have to use a Python library for that (believe: you wouldn’t want to implement all the algorithms in just one exercise).
We’ll be using the sorobn
library, which can be installed by typing in terminal:
pip install sorobn graphviz
Now we are ready to import our libraries:
Loading and wrangling data
- Dwnload the data from the website
- Get the file
breast-cancer.data
, and put it in thedata
folder - Load it in python using
pd.read_csv()
.
Since this file does not contain the column names, we are specifying them “by hand” when loading:
= ['class', 'age', 'menopause', 'tumor-size', 'inv-nodes', 'node-caps', 'deg-malig', 'breast', 'breast-quad', 'irradiat']
column_names = pd.read_csv("data/breast-cancer.data", names=column_names) breast
As you can observe in the data description, two variables of our dataset have missing data: node-caps
and breast-quad
. Since we are still not seeing how to deal with those situations (which can give different problems in different models), we’ll drop these variables for now. (Maybe you want to include them later to see what happens).
For that, you can run:
= breast.drop(['node-caps','breast-quad'], axis=1) breast
Part 1: Supervised learning with previous models
Before training our Bayesian network, let’s see what how we can predict with models we already used.
To make things more reusable, try to set variables indicating what are our target and features columns. For example, since our target is the class
column and our features are all the other ones, you can set:
= 'class'
target = [i for i in breast.columns if i != target] features
Part 1a: One-hot encoding of categorical variables
Our dataset consists of categorical variables, and we haven’t seen exactly how to use them as features with the sklearn library.
We will have to one-hot encode all of these features using a function available in the pandas library.
Execute the following code and then observe what happened:
= pd.get_dummies(breast, columns = features) breast_onehot
Answer: * What are the feature columns now? How many are they in total?
Part 1b: Train-test split and model fit
With our features having been one-hot encoded, we are ready for train-test splitting them. Split the dataset into 50% training data and 50% testing data.
Now, fit any model we have seen previously using the train data. Remember not to consider the target column as a feature.
Indicate which model you are using. Suggestions are decision trees, random forests, or logistic regression.
Part 1c: Accuracy metrics
Now, let’s evaluate the accuracy of our model on our test dataset.
- Make predictions with your model using the test dataset. For that, use the method
predict
in the model and remember to only pass feature variables to it (not target). - Now, use the function
accuracy_score
, from sklearn, to check our predictions against the true values of our target variable (in our case,'class'
). How much was it?
Part 2: The Bayesian network model
We will now check how well a Bayesian network model will perform.
However, it is very important to know that the reason for using a Bayesian network is not just increasing performance, but also:
- Being able to know probabilities on the outcomes (predictions), and
- Calculate these probabilities even in the case of some features having missing data.
Part 2a: Learning the structure
First, we will have to learn the Bayesian network structure. For that, run:
= hh.structure.chow_liu(breast)
structure = hh.BayesNet(*structure) bn
This means we are using the Chow-Liu algorithm to find (one of) the best network structure possible given the data.
You can visualize the network structure by running:
= bn.graphviz()
dot print(dot.source)
The print
above will return a graph structure according to the Graphviz dot syntax. You can copy it and paste it in Quarto itself, which then will render the graph. For details on how to render graphs in Quarto, check this documentation.
Show the resulting graph and reflect on it.
A big question is: does this graph really reflects some knowledge on how a variable would cause another? (Clearly no). This happens because network structure learning algorithms are trying only to find conditional independences between the variable. However, once we have these, multiple graph structures are possible, and some of them would really not make sense in a “causal” way. Remember what we said in class: Bayesian networks may show causality, but not always — sometimes they are just encoding some statistical relationships between variables (which may be useful enough, however).
Part 2b: Learning the parameters
Once we have our network structure, we can obtain the conditional probability distributions for our variables using data. Just run:
= bn.fit(breast) bn
Now we are ready for queries, inferences, predictions, etc.
Part 3: Queries and predictions with a Bayesian network
There are lots of operations we can perform with a Bayesian network, and some of them are explored in the tutorial for the sorobn library. For now, we’ll only make queries and check the resulting probabilities of our target variable.
Part 3a: Queries with few features
Just for testing, try making a query on the probabilities of class
just by using knowledge of one of the variables. For example, you may want to know the probability of class
given that age
is 30-39
:
'class', event={'age': '30-39'}) bn.query(
Try changing some of the features, or even changing the target variable (instead of class
, try checking probabilities on deg-malig
, or breast
, given information on other variables), and see which probabilities are returned. Put some examples, show them here.
Part 3b: Predicting with all the features
Now, let’s make queries using all our feature variables. However, keep note that since we are identifying conditional independencies between these variables, it may happen that the knowledge of one variable wouldn’t add anything to our prediction. (Optional: can you point out which variables are these, given our network?).
In any case, what we can do is convert our data rows to dictionaries and passing them to our queries. See:
= []
predictions for i in breast.to_dict('records'): # converting our database entries to dictionaries
del i[target] # dropping the target variable from the dict (only features then are remaining)
= n.query(target, event=i) # getting probabilities on the target variable given our features
probs # add to a list what will be the most likely value (with the highest probability) predictions.append(probs.idxmax())
Part 3c: Accuracy metrics
Now we are ready to compare our predictions with the true values of the target variable. Calculate the accuracy by using:
accuracy_score(=breast[target],
y_true=predictions
y_pred )
How much is this? Is it better than our previous model?
Part 4: Tweaking, visualizing, reflecting…
There are still interesting things you can try: changing the prediction model used previously and comparing it again with our Bayesian network. Or at least you can try tweaking some of the parameters of the model used.
However, as we noted above, the Bayesian network presents the advantage of giving us probabilities on our predictions. Which can lead us to answer: how “good” are these probabilities? Are we really certain of our predictions? How can we check that?