1.python机器学习的库:scikit-learn
1.1 特性:
简单高效的数据挖掘和机器学习分析;对所有用户开放,根据不同需求高度可重用性;基于Numpy,Scipy和matplotlib
1.2 覆盖问题领域:
分类;回归;聚类;降维;模型选择;预处理
2.使用scikit-learn
安装scikit-learn:pip,easy_install,windows installer
安装必要的package:numpy,scipy和matplotlib,可使用anaconda(包含numpy,scipy等科学计算常用的package)
我用的是Pycharm2017+anaconda3(包含了机器学习的库scikit-learn,和常用的package,可在cmd->conda list中查看)
3.例子:
RID | age | income | student | credit_rating | class:buys_computer |
1 | youth | high | no | fair | no |
2 | youth | high | no | excellent | no |
3 | middle_aged | high | no | fair | yes |
4 | senior | medium | no | fair | yes |
5 | senior | low | yes | fair | yes |
6 | senior | low | yes | excellent | no |
7 | middle_aged | low | yes | excellent | yes |
8 | youth | medium | no | fair | no |
9 | youth | low | yes | fair | yes |
10 | senior | medium | yes | fair | yes |
11 | youth | medium | yes | excellent | yes |
12 | middle_aged | medium | no | excellent | yes |
13 | middle_aged | high | yes | fair | yes |
14 | youth | medium | no | excellent | no |
安装Graphviz:http://www.graphviz.org/
配置环境变量:cmd->env
转化dot至pdf可视化决策树:安装了Graphviz,但一直没在cmd中转化出来,求指点!
源码:
3.例子
from sklearn.feature_extraction import DictVectorizer import csv from sklearn import preprocessing from sklearn import tree from sklearn.externals.six import StringIO # 读取文件 with open(r'E:\PycharmProjects\python\Decesiopn Tree\Allelectronicsdate.csv', 'r') as allElectronicsData: allElectronicsData = open(r'E:\PycharmProjects\python\Decesiopn Tree\Allelectronicsdate.csv', 'r') reader = csv.reader(allElectronicsData) headers = next(reader) print(headers) featureList = [] labelList = [] for row in reader: labelList.append(row[len(row) - 1]) rowDict = {} for i in range(1, len(row) - 1): rowDict[headers[i]] = row[i] featureList.append(rowDict) print(featureList) allElectronicsData.close() # 转化数据为sklearn要求的数据 vec = DictVectorizer() dummyX = vec.fit_transform(featureList).toarray() print("dummyX:" + str(dummyX)) print(vec.get_feature_names()) print("labelList:" + str(labelList)) lb = preprocessing.LabelBinarizer() dummyY = lb.fit_transform(labelList) print("dummyY:" + str(dummyY)) ## 决策树处理 clf = tree.DecisionTreeClassifier(criterion='entropy') clf = clf.fit(dummyX, dummyY) print("clf:" + str(clf)) ## 将决策树输出位dot文件 with open("E:\PycharmProjects\python\Decesiopn Tree\DecisionTree.dot", 'w') as f: f = tree.export_graphviz(clf, feature_names=vec.get_feature_names(), out_file=f) oneRowX = dummyX[0, :] print("oneRowX:" + str(oneRowX)) ## 给定新的数据进行预测 newRowX = oneRowX newRowX[0] = 1 newRowX[2] = 0 print("newRowX:" + str(newRowX)) predictedY = clf.predict(newRowX) print("predictedY:" + str(predictedY))
生成的决策树: