Tool Series: TensorFlow Decision Forest_(3) Visualization using dtreeviz

introduce

Previous tutorials have demonstrated how to prepare data, train, and evaluate using TensorFlow's Decision Forest (Random Forest, Gradient Boosted Trees, and CART) classifiers and regressors. (We abbreviate TensorFlow Decision Forest as TF-DF.) You also learned how to use built-in plot_model_in_colab()functions to visualize trees and display feature importance measures.

The goal of this tutorial is to explain classifier and regressor decision trees in more depth through visualization. We'll look at a detailed illustration of the tree structure and a depiction of how a decision tree partitions the feature space to make decisions. Tree structure diagrams help us understand the behavior of the model, and feature space diagrams help us understand the data by showing the relationship between features and target variables.

The visualization library we will be using is called dtreeviz and for consistency we will reuse the penguin and abalone data from the beginner tutorial

In this tutorial you will learn how to:

  • Show the structure of decision trees in TF-DF forest
  • Change the size and style of dtreeviz tree diagram
  • Plot leaf information such as the number of instances in each leaf, the distribution of target values ​​in each leaf, and various statistics about the leaves
  • Track the tree's interpretation of a specific instance and show the path from root to leaves to make predictions
  • English explanation of how print tree interprets examples
  • View 1D and 2D feature spaces to see how the model divides them into regions of similar instances

set up

Install TF-DF and dtreeviz

# 安装tensorflow_decision_forests库
!pip install -q -U tensorflow_decision_forests
# 安装 dtreeviz 库
!pip install -q -U dtreeviz

Import library


import tensorflow_decision_forests as tfdf

import tensorflow as tf

import os
import numpy as np
import pandas as pd
import tensorflow as tf
import math

import dtreeviz

from matplotlib import pyplot as plt
from IPython import display

# 避免“Arial字体未找到”的警告
import logging
logging.getLogger('matplotlib.font_manager').setLevel(level=logging.CRITICAL)

display.set_matplotlib_formats('retina') # 生成高分辨率的图形

np.random.seed(1234)  # 为了可重现的图形/数据解释的目的

2023-03-07 12:10:56.998585: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-03-07 12:10:56.998704: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-03-07 12:10:56.998714: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
/tmpfs/tmp/ipykernel_9236/31193553.py:20: DeprecationWarning: `set_matplotlib_formats` is deprecated since IPython 7.23, directly use `matplotlib_inline.backend_inline.set_matplotlib_formats()`
# 打印库的版本信息
tfdf.__version__, dtreeviz.__version__  # 希望 dtreeviz 的版本大于等于 2.2.0
('1.2.0', '2.2.0')

For convenience, we need to define a function to divide the data set into a training set and a test set:

# 定义一个函数split_dataset,用于将一个panda dataframe分成两部分,通常用于训练集和测试集的划分。
# 使用相同的随机种子确保我们得到相同的划分,以便本教程中的描述与生成的图像相对应。

def split_dataset(dataset, test_ratio=0.30, seed=1234):
    """
    将一个panda dataframe分成两部分,通常用于训练集和测试集的划分。
    使用相同的随机种子确保我们得到相同的划分,以便本教程中的描述与生成的图像相对应。
    
    参数:
    dataset:要划分的数据集,panda dataframe类型
    test_ratio:测试集所占比例,默认为0.30
    seed:随机种子,默认为1234
    
    返回值:
    划分后的训练集和测试集,均为panda dataframe类型
    """
    
    # 设置随机种子
    np.random.seed(seed)
    
    # 生成一个与dataset长度相同的随机数数组,元素值在0到1之间
    # 若随机数小于test_ratio,则对应位置为True,否则为False
    test_indices = np.random.rand(len(dataset)) < test_ratio
    
    # 返回划分后的训练集和测试集
    # 通过~test_indices可以得到test_indices的逻辑反,即对应位置为False的元素
    # 通过test_indices可以得到test_indices的逻辑值,即对应位置为True的元素
    return dataset[~test_indices], dataset[test_indices]

Visual classification tree

Using the penguin data, let's build a classifier to predict species( Adelie, Gentooor Chinstrap) in the other 7 columns. We can then use dtreeviz to display the tree and interrogate the model to understand how it makes decisions and understands our data.

Load, clean and prepare data

As with the beginner tutorial, let's start by downloading the penguin data and converting it into a pandas dataframe.

# 下载企鹅数据集
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/palmer_penguins/penguins.csv -O /tmp/penguins.csv

# 将数据集加载到 Pandas Dataframe 中
df_penguins = pd.read_csv("/tmp/penguins.csv")

# 显示前三行数据
df_penguins.head(3)
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 Adelie Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 Adelie Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 Adelie Torgersen 40.3 18.0 195.0 3250.0 female 2007

A quick check shows there are missing values ​​in the dataset:

df_penguins.columns[df_penguins.isna().any()].tolist()
['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g', 'sex']

Rather than filling in missing values, let's just delete incomplete rows to focus on visualization in this tutorial.

# 删除包含缺失值的行
df_penguins = df_penguins.dropna() # 例如,有19行缺少性别等信息...

TF-DF requires the classification labels to be integers in the range [0, num_labels), so let's convert the label column speciesfrom string to integer.

Note: TF-DF supports categorical string input features. You don't need to encode any feature values.

# 定义变量penguin_label,表示分类目标标签的名称
penguin_label = "species"

# 获取数据集中penguin_label列的所有唯一值,并将其转换为列表
classes = list(df_penguins[penguin_label].unique())

# 将数据集中的penguin_label列的值映射为它们在classes列表中的索引值
df_penguins[penguin_label] = df_penguins[penguin_label].map(classes.index)

# 打印输出分类目标标签的名称和对应的类别列表
print(f"Target '{
      
      penguin_label}'' classes: {
      
      classes}")

# 显示数据集的前3行
df_penguins.head(3)
Target 'species'' classes: ['Adelie', 'Gentoo', 'Chinstrap']
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex year
0 0 Torgersen 39.1 18.7 181.0 3750.0 male 2007
1 0 Torgersen 39.5 17.4 186.0 3800.0 female 2007
2 0 Torgersen 40.3 18.0 195.0 3250.0 female 2007

Now, let us split the training and test data in a 70-30 ratio using the convenience function defined above and convert these dataframes into tensorflow datasets.

Split the training/test set and train the model

# 将数据集分割为训练集和测试集
train_ds_pd, test_ds_pd = split_dataset(df_penguins)
print(f"{
      
      len(train_ds_pd)} 个训练样本,{
      
      len(test_ds_pd)} 个测试样本。")

# 将数据集转换为 TensorFlow 数据集
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(train_ds_pd, label=penguin_label)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(test_ds_pd, label=penguin_label)
243 examples in training, 90 examples for testing.

Train a random forest classifier

# 导入所需的库和模块

# 创建一个随机森林模型对象,设置参数verbose为0表示不输出训练过程中的详细信息,random_seed为1234表示设置随机种子为1234
cmodel = tfdf.keras.RandomForestModel(verbose=0, random_seed=1234)

# 使用训练数据集train_ds对模型进行训练
cmodel.fit(train_ds)
WARNING:tensorflow:From /tmpfs/src/tf_docs_env/lib/python3.9/site-packages/tensorflow/python/autograph/pyct/static_analysis/liveness.py:83: Analyzer.lamba_check (from tensorflow.python.autograph.pyct.static_analysis.liveness) is deprecated and will be removed after 2023-09-23.
Instructions for updating:
Lambda fuctions will be no more assumed to be used in the statement where they are used, or at least in the same block. https://github.com/tensorflow/tensorflow/issues/56089


[INFO 2023-03-07T12:11:06.100795433+00:00 kernel.cc:1214] Loading model from path /tmpfs/tmp/tmpeau3pdt_/model/ with prefix 72ee2781602146e9
[INFO 2023-03-07T12:11:06.113257784+00:00 decision_forest.cc:661] Model loaded with 300 root(s), 4310 node(s), and 7 input feature(s).
[INFO 2023-03-07T12:11:06.113286363+00:00 abstract_model.cc:1311] Engine "RandomForestGeneric" built
[INFO 2023-03-07T12:11:06.113305638+00:00 kernel.cc:1046] Use fast generic engine


WARNING:tensorflow:AutoGraph could not transform <function simple_ml_inference_op_with_handle at 0x7f67957524c0> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert
WARNING: AutoGraph could not transform <function simple_ml_inference_op_with_handle at 0x7f67957524c0> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: could not get source code
To silence this warning, decorate the function with @tf.autograph.experimental.do_not_convert





<keras.callbacks.History at 0x7f68310ddd90>

Just to verify that everything is working properly, let's check the model's accuracy, which should be around 99%:

# 对模型进行编译,使用"accuracy"作为评估指标
cmodel.compile(metrics=["accuracy"])

# 对测试数据集进行评估,返回字典形式的评估结果,verbose=0表示不输出评估过程
cmodel.evaluate(test_ds, return_dict=True, verbose=0)
{'loss': 0.0, 'accuracy': 0.9888888597488403}

Yes, the model's accuracy on the test set is high.

show decision tree

Now that we have a model, let's select a tree in the random forest and see its structure. The dtreeviz library requires us to bundle the TF-DF model with relevant training data, and then the model can be interrogated repeatedly.

# 获取penguin数据集的特征名
penguin_features = [f.name for f in cmodel.make_inspector().features()]

# 创建一个dtreeviz的可视化模型
# 参数说明:
# - cmodel: 训练好的决策树模型
# - tree_index: 指定要可视化的决策树的索引
# - X_train: 训练集的特征数据
# - y_train: 训练集的标签数据
# - feature_names: 特征的名称列表
# - target_name: 目标变量的名称
# - class_names: 类别的名称列表
viz_cmodel = dtreeviz.model(cmodel,
                           tree_index=3,
                           X_train=train_ds_pd[penguin_features],
                           y_train=train_ds_pd[penguin_label],
                           feature_names=penguin_features,
                           target_name=penguin_label,
                           class_names=classes)

The most common dtreeviz API function is view()that it displays the structure of the tree and the feature distribution of the instances associated with each decision node.

# 调用viz_cmodel的view方法,并设置缩放比例为1.2,用于显示模型的可视化结果。
viz_cmodel.view(scale=1.2)

Insert image description here

Insert image description here

The root node of the decision tree represents the test flipper_length_mmfeature passed at the beginning of classification, using a split value of 206. flipper_length_mmIf the eigenvalue of the test instance is less than 206, the decision tree descends to the left child node. If it is greater than or equal to 206, classification proceeds by descending to the right child node.

To understand why the model chooses flipper_length_mmto split the training data at =206, let us zoom in on the root node.

# 设置深度范围和缩放比例,并显示模型
viz_cmodel.view(depth_range_to_display=[0,0], scale=1.5)

Insert image description here

Clearly see that almost all instances to the right of 206 are blue ( GentooPenguin). Therefore, with a single feature comparison, the model can separate the training data into a fairly pure Gentoogroup and a mixed group. (The model will further purify subgroups through future splits below the root node.)

Decision trees also have classification decision nodes that test subsets of categories instead of simple numerical splits. For example, let's look at the second level of the tree:

# 调用viz_cmodel的view函数,并设置参数
# depth_range_to_display参数用于指定显示的深度范围,这里设置为[1,1],表示只显示深度为1的部分
# scale参数用于指定显示的缩放比例,这里设置为1.5,表示放大1.5倍显示
viz_cmodel.view(depth_range_to_display=[1,1], scale=1.5)

Insert image description here

The node (left) tests the feature island, and if the test instance has it island==Dream, the classification continues down to its right child node. For the other two categories Torgersenand Biscoe, the classification continues moving down to its left child node. (In this figure, bill_length_mmthe nodes on the right are irrelevant to the discussion of classification decision nodes.)

This segmentation behavior highlights the goal of decision trees to divide the feature space into regions where the purity of the target value increases. We will look at the feature space in more detail below.

Decision trees can become very large, and it is not always useful to draw them in their entirety. However, we can look at a simplified version of the tree, the parts of the tree, the number of training instances in the various leaf nodes (where predictions are made), etc... Here is an example where we turn off the nice decision node distribution plot and put the entire image Scaled down to 75%:

# 调用viz_cmodel的view函数,以可视化模型
# 参数fancy设置为False,表示不使用复杂的样式
# 参数scale设置为0.75,表示缩放比例为0.75
viz_cmodel.view(fancy=False, scale=.75)

Insert image description here

We can also use a left-to-right direction, which sometimes results in a smaller plot.

# 设置可视化模型的方向为从左到右,缩放比例为0.75
viz_cmodel.view(orientation='LR', scale=.75)

Insert image description here

If you're not a fan of pie charts, you can also use bar charts.

# 使用viz_cmodel对象的view方法展示数据可视化结果
# leaftype参数指定使用条形图展示数据
# scale参数指定缩放比例为0.75,即将图形缩小为原来的75%大小
viz_cmodel.view(leaftype='barh', scale=.75)

Insert image description here

Check leaf node statistics

Decision trees make decisions at leaf nodes, so it is sometimes useful to focus on leaf nodes if the entire graph is too large to view everything at once. Here's how to check the number of training data instances grouped in each leaf node:

# 调用viz_cmodel的leaf_sizes方法,并设置figsize参数为(5,1.5)
viz_cmodel.leaf_sizes(figsize=(5,1.5))

Perhaps a more interesting graph would be to show the proportion of each training instance in each leaf. The goal of training is to have a leaf node have a single color because it represents a "pure" node that can predict that class with high confidence.

# 调用ctree_leaf_distributions函数,并设置图像大小为(5,1.5)
viz_cmodel.ctree_leaf_distributions(figsize=(5,1.5))

We can also zoom in on specific leaf nodes to view some statistics on individual instance features. For example, leaf node 5 contains 31 instances, 24 of which have unique bill_length_mmvalues:

# 调用viz_cmodel的node_stats方法,传入参数node_id=5,用于获取节点5的统计信息。
viz_cmodel.node_stats(node_id=5)
bill_depth_mm bill_length_mm body_mass_g flipper_length_mm island sex year
count 31.0 31.0 31.0 31.0 31 31 31
unique 24.0 28.0 26.0 17.0 1 2 3
top 18.5 39.5 3300.0 185.0 Dream female 2009
freq 4.0 2.0 2.0 4.0 31 19 11

How decision trees classify instances

Now that we understand the structure and content of a decision tree, let's figure out how the classifier makes a decision about a specific instance. xBy passing an instance (feature vector) into view()the function as an argument, the function will highlight the root-to-leaf path that the classifier pursues to make a prediction for that instance.


# 选择第20个样本
x = train_ds_pd[penguin_features].iloc[20]

# 调用viz_cmodel库中的view函数,可视化样本x
viz_cmodel.view(x=x, scale=.75)

Insert image description here

Description:
This illustration highlights the tree path and instance features ( island, bill_length_mmand flipper_length_mm) being tested.

For very large trees, you can also show_just_pathsee only the path of the tree, rather than the entire tree, by using the argument.

# 调用viz_cmodel的view方法来可视化模型
# 参数x表示输入数据
# 参数show_just_path表示只显示路径
# 参数scale表示缩放比例为0.75
viz_cmodel.view(x=x, show_just_path=True, scale=.75)

Insert image description here

To obtain an English explanation of an instance classification, explain_prediction_path()a function is used to obtain the smallest possible representation.

# 打印可视化模型的解释预测路径
print(viz_cmodel.explain_prediction_path(x=x))
bill_length_mm < 40.6
flipper_length_mm < 206.0
island in {'Dream'}  

The model tests xthe bill_length_mm, flipper_length_mmand islandfeatures of , to reach the leaf node, which is predicted to be Adelie.

Feature space partitioning

So far we have seen the structure of a tree and how it interprets instances to make decisions, but what exactly do decision nodes do? Decision trees partition the feature space into a set of observations that share similar target values. Each leaf node represents a partition resulting from a sequence of feature splits performed from the root node to that leaf node. For classification problems, the goal is to have partitions share the same or most of the same target class values.

If we look back at the structure of the tree, we see that the variable flipper_length_mmis tested by three nodes in the tree. The corresponding decision node split values ​​are 189, 206 and 210.5, which means that the decision tree will flipper_length_mmbe divided into four areas, which we can use ctree_feature_space()to illustrate:

# 调用ctree_feature_space函数,并传入参数
# features参数指定要显示的特征,这里只显示'flipper_length_mm'
# show参数指定要显示的内容,这里显示'splits'和'legend'
# figsize参数指定图像的大小,这里设置为(5,1.5)
viz_cmodel.ctree_feature_space(features=['flipper_length_mm'], show={
    
    'splits','legend'}, figsize=(5,1.5))

(In this single-feature case, the vertical axis has no meaning. To increase visibility, the vertical axis just separates the points representing different target classes into different heights and adds some noise.)

The first split point at 206 (tested at the root) splits the training data into an overlapping region containing Adelie/Gentoo Penguins, and a comparable region of Chinstrap Penguins. A subsequent segmentation at 210.5 further isolates a pure Chinstrap region (fin length greater than 210.5). The decision tree also splits at 189, but the resulting region is still impure. Trees rely on splitting by other variables to separate "chaotic" Adelie/Gentoo Penguins. Because we only passed in one feature name, the segmentation of other features was not shown.

Let's look at another feature with more segmentation, bill_length_mm. There are four nodes in the decision tree testing this feature, so we get a result that splits the feature space into five regions. Note that the model can bill_length_mmsegment a pure Adelieregion by testing less than 40.

# 调用ctree_feature_space函数,并传入参数features=['bill_length_mm'],表示只显示bill_length_mm特征
# 参数show={'splits','legend'}表示显示决策树的分割线和图例
# 参数figsize=(5,1.5)表示设置图像的大小为5x1.5
viz_cmodel.ctree_feature_space(features=['bill_length_mm'], show={
    
    'splits','legend'}, figsize=(5,1.5))

We can also simultaneously examine how the tree partitions the feature space into two features, such as flipper_length_mmand bill_length_mm:

# 调用ctree_feature_space函数,并传入参数
# features参数指定要显示的特征,这里是'flipper_length_mm'和'bill_length_mm'
# show参数指定要显示的内容,这里是'splits'和'legend'
# figsize参数指定图像的大小,这里是(5,5)
viz_cmodel.ctree_feature_space(features=['flipper_length_mm','bill_length_mm'],
                               show={
    
    'splits','legend'}, figsize=(5,5))

The color of the region represents the classification color of the test instance whose features fall within the region.

By considering both variables simultaneously, decision trees can create cleaner (rectangular) regions, resulting in more accurate predictions. For example, the upper left area entirely contains the "Chinstrap" penguin.

Depending on the variables we choose, the purity of the zones will vary. This is another two-dimensional feature space partitioning based on the bill_depth_mmsum bill_length_mmfeature, where shading represents uncertainty.

# 使用ctree_feature_space函数绘制特征空间图
# 参数features指定要绘制的特征,这里选择了'body_mass_g'和'bill_length_mm'
# 参数show指定要显示的内容,这里选择了'splits'和'legend'
# 参数figsize指定图像的大小,这里设置为(5,5)
viz_cmodel.ctree_feature_space(features=['body_mass_g','bill_length_mm'], show={
    
    'splits','legend'}, figsize=(5,5))

Only Adelieareas are relatively pure. Trees rely on other variables to get better partitioning, like we just saw with flipper_length_mmvs space.bill_length_mm

Currently, the dtreeviz library cannot visualize classifications with more than two feature dimensions.

By now, you have a good grasp of how to visualize the structure of a decision tree, how the tree partitions the feature space, and how the tree classifies test instances. Now let's turn to regression and see how dtreeviz can visualize regression trees.

Visualizing regression trees

Let's explore the structure of a regression tree using the abalone dataset used in the beginner's tutorial . Same as the classification above, we first load and prepare the training data. Given 8 variables, we want to predict the number of rings in an abalone shell.

Load, clean and prepare data

Using the following code snippet, we can see that Typeall features are numeric except for the (gender) variable.

# 下载数据集
!wget -q https://storage.googleapis.com/download.tensorflow.org/data/abalone_raw.csv -O /tmp/abalone.csv

# 读取CSV文件并将数据存储在DataFrame中
df_abalone = pd.read_csv("/tmp/abalone.csv")

# 显示DataFrame的前3行数据
df_abalone.head(3)
Type LongestShell Diameter Height WholeWeight ShuckedWeight VisceraWeight ShellWeight Rings
0 M 0.455 0.365 0.095 0.5140 0.2245 0.1010 0.15 15
1 M 0.350 0.265 0.090 0.2255 0.0995 0.0485 0.07 7
2 F 0.530 0.420 0.135 0.6770 0.2565 0.1415 0.21 9

Fortunately, there are no missing data to deal with:

# 使用isna()方法检查数据集中是否存在缺失值,any()方法判断是否存在缺失值
df_abalone.isna().any()
Type             False
LongestShell     False
Diameter         False
Height           False
WholeWeight      False
ShuckedWeight    False
VisceraWeight    False
ShellWeight      False
Rings            False
dtype: bool

Split the training/test set and train the model

# 定义分类目标标签名称为 "Rings"
abalone_label = "Rings"

# 将数据集按照 70/30 的比例分为训练集和测试集
df_train_abalone, df_test_abalone = split_dataset(df_abalone)

# 输出训练集和测试集的样本数量
print(f"{
      
      len(df_train_abalone)} examples in training, {
      
      len(df_test_abalone)} examples for testing.")

# 将数据集转换为 TensorFlow 数据集
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(df_train_abalone, label=abalone_label, task=tfdf.keras.Task.REGRESSION)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(df_test_abalone, label=abalone_label, task=tfdf.keras.Task.REGRESSION)
2935 examples in training, 1242 examples for testing.

Train a random forest regressor

Now that we have the training and test sets, let's train a random forest regressor. Due to the nature of the data, we need to artificially limit the height of the tree for visualization. (Limiting the depth of the tree is also a form of regularization and is used to prevent overfitting.) A depth of 5 is sufficiently accurate while being small enough for visualization.

# 创建一个随机森林模型
rmodel = tfdf.keras.RandomForestModel(task=tfdf.keras.Task.REGRESSION,  # 设置任务为回归
                                      max_depth=5,      # 设置树的最大深度为5,避免树过大
                                      random_seed=1234, # 设置随机种子,确保每次创建相同的树
                                      verbose=0)       # 设置不显示训练过程中的详细信息

# 使用训练数据集进行模型训练
rmodel.fit(x=train_ds)
[INFO 2023-03-07T12:11:19.959239957+00:00 kernel.cc:1214] Loading model from path /tmpfs/tmp/tmpdts8fzxf/model/ with prefix a5115ef6d4b2486a
[INFO 2023-03-07T12:11:19.98628563+00:00 decision_forest.cc:661] Model loaded with 300 root(s), 9264 node(s), and 8 input feature(s).
[INFO 2023-03-07T12:11:19.986325053+00:00 abstract_model.cc:1311] Engine "RandomForestOptPred" built
[INFO 2023-03-07T12:11:19.986350895+00:00 kernel.cc:1046] Use fast generic engine





<keras.callbacks.History at 0x7f68310dd430>

Let us use MAE and MSE to check the accuracy of the model. RingsThe range of is 1-27, so a MAE of 1.66 on the test set isn't great, but it's okay for our demonstration purposes.

# 编译模型,指定评估指标为平均绝对误差(MAE)和均方误差(MSE)
rmodel.compile(metrics=["mae","mse"])

# 在测试数据集上评估模型,并返回评估结果
evaluation = rmodel.evaluate(test_ds, return_dict=True, verbose=0)

# 打印均方误差(MSE)
print(f"MSE: {
      
      evaluation['mse']}")

# 打印平均绝对误差(MAE)
print(f"MAE: {
      
      evaluation['mae']}")

# 打印均方根误差(RMSE),通过对均方误差取平方根得到
print(f"RMSE: {
      
      math.sqrt(evaluation['mse'])}")
MSE: 5.4397759437561035
MAE: 1.6559592485427856
RMSE: 2.3323327257825164

show decision tree

To use dtreeviz, we need to bundle the model and training data together. We must also choose a specific tree in the random forest to display; let's choose tree 3, just like we did for the classification problem.

# 创建一个列表abalone_features,其中包含了rmodel模型的所有特征的名称
abalone_features = [f.name for f in rmodel.make_inspector().features()]

# 使用dtreeviz库中的model函数创建一个决策树可视化模型viz_rmodel
# 设置tree_index参数为3,表示选择第三棵决策树进行可视化
# 使用X_train参数传入训练集的特征数据df_train_abalone[abalone_features]
# 使用y_train参数传入训练集的目标数据df_train_abalone[abalone_label]
# 使用feature_names参数传入特征的名称列表abalone_features
# 使用target_name参数传入目标变量的名称'Rings'
viz_rmodel = dtreeviz.model(rmodel, tree_index=3,
                           X_train=df_train_abalone[abalone_features],
                           y_train=df_train_abalone[abalone_label],
                           feature_names=abalone_features,
                           target_name='Rings')

The function view()shows the structure of the tree, but now the decision nodes are scatter plots instead of stacked bar plots. Each decision node shows Ringsa marginal plot of the specified variable versus the objective ( ).

# 调用viz_rmodel的view方法,并设置缩放比例为1.2,用于可视化rmodel模型。
viz_rmodel.view(scale=1.2)

Insert image description here

Like classification, regression proceeds from the root of the tree toward specific leaves, ultimately making predictions for specific test instances. Nodes on the path to the leaves test numerical or categorical variables, directing the regressor to specific regions of feature space with (hopefully) very similar target values.

Leaves are strip charts that display the values ​​of the target variable "Rings" for all instances in the leaf. The level parameter has no meaning, just a bit of noise to separate the points so we can see where the density is distributed. Consider the lower left leaf, n=10, Rings=3.30. This means that the average "Rings" value of the 10 instances in that leaf is 3.30, which is what the decision tree predicts for any test instance that reaches that leaf.

Let's zoom into the root of the tree and see how the regressor splits based on the variable "ShellWeight":

# 调用viz_rmodel库中的view函数,并传入参数depth_range_to_display=[0,0]和scale=2
viz_rmodel.view(depth_range_to_display=[0,0], scale=2)

Insert image description here

For a ShellWeight<0.164test instance with , the regressor proceeds along the left child of the root node; otherwise, it proceeds along the right child. The horizontal dashed line represents the meanShellWeight associated with instances greater or less than 0.164 .Rings

On the other hand, for categorical variables, the decision node tests a subset of the categories because the categories are unordered. In the fourth level of the tree, there are two Typedecision nodes that test categorical variables:

# 调用viz_rmodel的view方法来显示可视化结果
# depth_range_to_display参数指定了要显示的深度范围,这里设置为[3,3],表示只显示深度为3的部分
# scale参数指定了显示的缩放比例,这里设置为1.5,表示放大1.5倍显示结果
viz_rmodel.view(depth_range_to_display=[3,3], scale=1.5)

Insert image description here

Insert image description here

Classifier nodes use colors to indicate subsets. For example, the decision node on the left side of the fourth layer instructs the classifier to go down to the left when the test instance is Type=Ior Type=F; otherwise, the classifier goes down to the right. Yellow and blue represent the two subsets of categorical values ​​associated with the left and right branches. The horizontal dashed line represents the average target value for instances with associated categorical values Rings.

To display a large tree, you can use the orientation parameter to get a left-to-right version of the tree, although it's quite tall, so using scale to shrink it is a good idea. Use the screen zoom feature on your computer to zoom in on the area of ​​interest.

# 调用view函数,设置参数orientation为'LR',表示水平方向从左到右排列;设置参数scale为0.5,表示缩放比例为0.5
viz_rmodel.view(orientation='LR', scale=.5)

Insert image description here

We can use non-fancy charts to save space. It still shows the split variables and split points for the decision nodes; it's just not pretty.

# 使用viz_rmodel库中的view函数来可视化模型
# 参数fancy设置为False,表示不使用复杂的样式
# 参数scale设置为0.75,表示缩放比例为0.75
viz_rmodel.view(fancy=False, scale=.75)

Insert image description here

Check leaf node statistics

When a graph becomes very large, it is sometimes better to focus on the leaf nodes. The function leaf_sizes()indicates the number of instances found in each leaf node:


# 调用leaf_sizes函数,并设置figsize参数为(5,1.5),用于指定绘图的大小
viz_rmodel.leaf_sizes(figsize=(5,1.5))

We can also view the distribution ( Ringsvalues) of instances in leaf nodes. There is a row for each leaf node on the vertical axis, and the horizontal axis shows the distribution of values ​​for instances in each leaf node Rings. The right column shows the average target value for each leaf node.

# 调用viz_rmodel库中的rtree_leaf_distributions函数,并设置图像大小为(5,5)
viz_rmodel.rtree_leaf_distributions(figsize=(5,5))

Alternatively, we can obtain information about instance characteristics in a specific node. For example, here's how to get information about the feature in leaf node 29, which has the most instances:

# 调用viz_rmodel模块中的node_stats函数
# 传入参数node_id=29,表示要获取节点ID为29的统计信息
viz_rmodel.node_stats(node_id=29)
Diameter Height LongestShell ShellWeight ShuckedWeight Type VisceraWeight WholeWeight
count 672.0 672.000 672.00 672.000 672.0000 672 672.000 672.000
unique 42.0 18.000 48.00 262.000 483.0000 3 363.000 556.000
top 0.5 0.175 0.65 0.335 0.5985 F 0.318 1.262
freq 66.0 115.000 44.00 22.000 5.0000 328 11.000 4.000

How decision trees predict the value of an instance

To make predictions for a specific instance, the decision tree extends downward from the root node to a specific leaf node based on the feature values ​​in the test instance. The predicted value of an individual tree is simply the average of the values ​​of the instances (from the training set) residing in that leaf node Rings. The dtreeviz library can illustrate this process if we provide a test instance via parameters x.

# 从df_abalone数据集中获取第1234行的数据
x = df_abalone[abalone_features].iloc[1234]

# 调用viz_rmodel库中的view函数,可视化数据
viz_rmodel.view(x=x, scale=.75)

Insert image description here

If this visualization is too large, we can reduce the diagram to the actual traversed path from root to leaves.

# 调用viz_rmodel的view方法来可视化模型
# 参数x表示输入数据
# 参数show_just_path表示只显示路径
# 参数scale表示缩放比例为1.0
viz_rmodel.view(x=x, show_just_path=True, scale=1.0)

Insert image description here

We can use the horizontal direction to make it smaller:

# 调用viz_rmodel的view函数来可视化模型
# 参数x表示模型
# 参数show_just_path表示只显示路径
# 参数scale表示缩放比例
# 参数orientation表示图的方向为从左到右
viz_rmodel.view(x=x, show_just_path=True, scale=.75, orientation="LR")

Insert image description here

Sometimes it's easier to get an English description of how the model tests our feature values ​​to make decisions:

# 打印可视化模型的预测解释路径
# 参数 x 为输入数据
print(viz_rmodel.explain_prediction_path(x=x))
0.25 <= Diameter 
ShellWeight < 0.11
Type not in {'M', 'F'}  

Feature space partitioning

Using rtree_feature_space()the function, we can see that the decision tree divides the feature space through a series of splits. For example, here is ShellWeightan example of how a decision tree divides features:

# 使用rtree_feature_space函数来生成特征空间的可视化图表
# features参数指定要显示的特征,这里只显示'ShellWeight'
# show参数指定要显示的内容,这里只显示'splits'
viz_rmodel.rtree_feature_space(features=['ShellWeight'], show={
    
    'splits'})

The horizontal orange bars represent the average Rings value within each zone. Here is another example using the feature "Diameter" (only one split point in the tree):

# 调用rtree_feature_space函数,传入参数features=['Diameter']和show={'splits'}
# features参数指定了要在可视化中展示的特征,这里只展示了直径(Diameter)这一项
# show参数指定了要展示的内容,这里指定了展示决策树的分裂情况(splits)
viz_rmodel.rtree_feature_space(features=['Diameter'], show={
    
    'splits'})

We can also look at the 2D feature space, where the "Rings" values ​​vary in color from green (low) to blue (high):


# 创建一个可视化模型对象
viz_model = viz_rmodel.rtree_feature_space(features=['ShellWeight','LongestShell'], show={
    
    'splits'})

That heatmap can be confusing because it's actually a two-dimensional projection of a three-dimensional space: two features x target values. Instead, dtreeviz can show you this three-dimensional plot (from various angles and heights).

# 创建一个3D特征空间图
# 参数features指定要在图中显示的特征,这里选择了'ShellWeight'和'LongestShell'
# 参数show指定要在图中显示的内容,这里选择了'splits',表示显示决策树的分割线
# 参数elev、azim和dist分别指定了视角的高度、方位和距离
# 参数figsize指定了图的大小
viz_rmodel.rtree_feature_space3D(features=['ShellWeight','LongestShell'],
                              show={
    
    'splits'}, elev=30, azim=140, dist=11, figsize=(9,8))

If the model only tests two features ShellWeightand LongestShell, then there will be no overlapping vertical "plates". Each two-dimensional region of the feature space makes a unique prediction. Within this tree, there are other features that distinguish blurry vertical prediction regions.

At this stage, you have learned how to use dtreeviz to display the structure of a decision tree, plot leaf node information, track how the model interprets a specific instance, and how the model divides the future space. You're ready to visualize and interpret trees using your own data set!

Guess you like

Origin blog.csdn.net/wjjc1017/article/details/135189429