Python决策树入门案例: 泰坦尼克号幸存预测(决策树可视化)

Python决策树入门案例:

 决策树是机器学习中一个比较重要而且常用的算法, 是基于香农的信息论计算信息熵然后计算信息增益
 然后划分决策树的"枝叶
	 Python实现计算信息熵公式: 
	 from fractions import Fraction  # 导入分数计算模块
	 from math import log

	 a = Fraction(4, 6)  # 正例占3/6 
	 b = Fraction(2, 6)  # 反例占2/6
	 Ent(D) = -(a * log(a, 2) + b * log(b, 2))  # |Y| = 2
	import numpy as np
	import pandas as pd

	file_path = "E:\\数据集\\练习数据集\\titanic.csv"
	data = pd.read_csv(file_path)
	data["age"].isnull().sum()  # 年龄有680个缺失值

	x = data[["pclass", "age", "sex"]]
	y = data["survived"]

	x["age"].fillna(x["age"].mean(), inplace=True)  # 用平均年龄来填充缺失值

	from sklearn.model_selection import train_test_split  # 导入数据集分割
	from sklearn.feature_extraction import DictVectorizer  # 导入特征工程
	from sklearn.tree import DecisionTreeClassifier  # 导入决策树分类器

	x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)  # 分割数据
	
	info = DictVectorizer(sparse=False)  # 特征工程
	x_train = info.fit_transform(x_train.to_dict(orient="records"))
	x_test = info.fit_transform(x_test.to_dict(orient="records"))
	
	dec = DecisionTreeClassifier(max_depth=5)
	dec.fit(x_train, y_train)
	dec.score(x_test, y_test)
	dec.predict(x_test[0: 1])  # 进行测试预测
	
	import pydotplus  # 可视化
	dot_data = tree.export_graphviz(dec, out_file=None,
                                filled=True, rounded=True,
                                special_characters=True)
	graph = pydotplus.graph_from_dot_data(dot_data)
	graph.get_nodes()[7].set_fillcolor("#FFF2DD")
	graph.write_png("graph.png")
	from IPython.display import Image
	Image(graph.create_png())
构造的决策树如下:

在这里插入图片描述
数据集地址: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt

发布了18 篇原创文章 · 获赞 15 · 访问量 2791

猜你喜欢

转载自blog.csdn.net/qq_42768234/article/details/99453826
今日推荐