目录
说在前面
- 操作系统:win10
- python版本:3.6.3
- kettle版本:8.3
- 数据集:Soda
- Apriori算法:【数据挖掘】笔记二-决策树
数据处理
- 原数据格式
- 目标格式
- Kettle转换
【数据挖掘】Kettle去除空记录&添加标记
python code
import numpy as np
import scipy as sp
from sklearn import tree
from sklearn.metrics import precision_recall_curve
#决策树的基本操作
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
#数据读入
data = []
labels = []
with open("file.txt",encoding="utf-8") as ifile:
for line in ifile:
tokens = line.strip().split(';')
bol = 0
if tokens[0] == '雨':
bol = 1
data_elem=[bol,float(tokens[1]),float(tokens[2])]
#print(data_elem)
data.append(data_elem)
labels.append(tokens[3].rstrip())
x = np.array(data)
labels = np.array(labels)
y = np.zeros(labels.shape)
#标签转换为浮点数
y[labels=='[0,10)']=0
y[labels=='[10,60)']=1
y[labels=='[60,-)']=2
# 拆分训练数据与测试数据
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.4)
# 核心代码:使用信息熵作为划分标准,对决策树进行训练
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(x_train, y_train)
# 测试结果
answer = clf.predict(x_test)
print(classification_report(y_test, answer))
结果
-
测试集结果
准确率不是很高,可能算法不好,或者数据之间联系不大,俺也木有办法啊
-
可视化
可视化需要pip install graphviz
并且安装 这个
import graphviz import os # 可视化决策树,放在最后 os.environ['PATH'] += os.pathsep + 'D:/Program Files (x86)/Graphviz/bin' dot_data = tree.export_graphviz(clf, out_file=None, feature_names=["PH","elect"], class_names=["[0,10)","[10,60)","[60,-)"], filled=True, rounded=True) graph = graphviz.Source(dot_data) graph.format = 'png' graph.render("water", view=True)