机器学习初步：鸢尾花data建立第一个机器学习项目

Estimators

Given a scikit-learn estimator object named model, the following methods are available:

Available in all Estimators
- model.fit() : fit training data. For supervised learning applications,this accepts two arguments: the data X and the labels y (e.g. model.fit(X, y)).For unsupervised learning applications, this accepts only a single argument,the data X (e.g. model.fit(X)).
Available in supervised estimators
- model.predict() : given a trained model, predict the label of a new set of data.This method accepts one argument, the new data X_new (e.g. model.predict(X_new)),and returns the learned label for each object in the array.
- model.predict_proba() : For classification problems, some estimators also providethis method, which returns the probability that a new observation has each categorical label.In this case, the label with the highest probability is returned by model.predict().
- model.score() : for classification or regression problems, most (all?) estimators implementa score method. Scores are between 0 and 1, with a larger score indicating a better fit.
Available in unsupervised estimators
- model.predict() : predict labels in clustering algorithms.
- model.transform() : given an unsupervised model, transform new data into the new basis.This also accepts one argument X_new, and returns the new representation of the data basedon the unsupervised model.
- model.fit_transform() : some estimators implement this method,which more efficiently performs a fit and a transform on the same input data.

用鸢尾花data建立第一个机器学习项目，项目实现的源码（其中包含了具体进行的步骤以及相关的注意事项）如下：

import pandas
from pandas.tools.plotting import scatter_matrix   # 导入散点图矩阵包
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression  # 线性模型中的逻辑回归
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis  # 判别分析算法中的线性判别分析包
from sklearn.neighbors import KNeighborsClassifier  # 最近邻算法中的KNN最近邻分类包
from sklearn.tree import DecisionTreeClassifier   # 树算法中的决策树分类包
from sklearn.naive_bayes import GaussianNB  # 朴素贝叶斯中的高斯朴素贝叶斯包
from sklearn.svm import SVC  # 支持向量机中的支持向量分类包
from sklearn.metrics import accuracy_score  # 计算精度得分
from sklearn.metrics import confusion_matrix  # 计算混淆矩阵，用来评估分类的准确性
from sklearn.metrics import classification_report  # 将主要分析结果以文本形式输出

# 导入数据集
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = pandas.read_csv(url, names=names, na_values=['NA'])    # 读取csv数据，并且将指定数据集的每一列名称，规定缺失的数据同意命名为NA

# 数据的描述性统计
# print(dataset.shape)       # 对数据的维度进行考察，结果为(150，5)，表明该数据集有150个样本，5个变量(属性)
# print(dataset.head(10))    # 考察数据本身，取前十行数据详细观察

# 数据所有属性的描述性统计，分别表示四个属性的样本值(count)，均值(mean)，标准误差(std)，最小值，25%分位数，中位数，75%分位数，最大值
# 四分位数（Quartile）是指在统计学中把所有数值由小到大排列并分成四等份，处于三个分割点位置的数值。
# print(dataset.describe())
# print(dataset.groupby('class').size())  # Iris-setosa、Iris-versicolor、Iris-virginica三大类样本值均有50个

# 数据的可视化
# 此处通过两种图形类型：单变量图了解每个属性，多变量图了解各个属性间的关系
############################################################################
# 单变量图：盒须图(箱型图)
# 能提供有关数据位置和分散情况的关键信息，其中应用到了分位值（数）的概念。
# 将一组数据从大到小排列，分别计算出他的上边缘，上四分位数，中位数，下四分位数，下边缘。
# dataset.plot(kind='box', subplots=True, layout=(2, 2), sharex=False, sharey=False)  # 数据图形化之前的处理
# plt.show()   # 将数据图形化显示出来
# 单变量图：直方图
# 看图中的直方图，有两张类似于正态分布，我们可以用相应的算法来处理相关数据，横坐标是尺度，纵坐标是方向
# dataset.hist()
# plt.show()
############################################################################
# 多变量图(图矩阵)
# 散点图有助于发现变量之间的结构化关系，代表了两变量的相关程度，如果呈现出沿着对角线分布的趋势，说明它们的相关性较高
# 散点图可用于发现异常数据，
# scatter_matrix(dataset)
# plt.show()

# 利用算法对数据进行估计
############################################################################
# 建立验证数据集
# 将加载的数据集分为两部分，80%用于训练模型，20%用于验证数据集
array1 = dataset.values    # 将数据库转化成数组形式
X = array1[:, 0:4]        # 取前四列作为属性数据
Y = array1[:, 4]          # 最后一列作为类型属性
validation_size = 0.20    # 验证集规模
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=7)   # 分割数据集
# X_train, Y_train 训练集    Y_validation，X_validation 验证集
############################################################################
# 测试机制
# 采用10倍交叉验证机制(数据集9份训练，1份验证，然后交叉进行)来验证模型的精确度，标准是accuracy，即数据的精确度
seed = 7
scoring = 'accuracy'
############################################################################
# 构建模型
# 这里采用6种算法来进行精确性评估，综合了简单线性(LR和LDA)，非线性(KNN, CART, NB, SVM)，每次重置种子来隔离执行每个算法的评估
# LR：逻辑回归   LDA：线性判别分析   KNN：k最近邻    CART：分类和回归树    NB：高斯朴素贝叶斯    SVM：支持向量机
models = []     # 建立模型的名字与算法对应的列表
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))
results = []
names = []
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)  # 建立10倍交叉验证
    cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)  # 计算每一个模型的精度得分
    results.append(cv_results)
    names.append(name)
    print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
############################################################################
# 选择最优模型
# 可以绘制模型评估结果的图像
fig = plt.figure()   # 自定义画布大小
fig.suptitle('algorithm comparsion')  # 定义画布名称，其中参数figsize设置画布大小
ax = fig.add_subplot(111)   # 设置画布划分，以及画像在画布上输出的位置，111表示在1*1画布中第一块区域输出图像
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

# 预测
# 在测试中看到knn算法精度在测试中最高，但是需要验证其在验证集上的精确性，所以需要对knn进行独立性的终极检验，以防止过度拟合或者数据遗漏
# 直接在验证集上运行knn模型，并将结果总结为：最终准确度得分、混淆矩阵、分类预测报告
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train) #knn拟合序列集
predictions = knn.predict(X_validation) #预测验证集
print(accuracy_score(Y_validation, predictions)) #验证集精度得分
print(confusion_matrix(Y_validation, predictions)) #混淆矩阵
print(classification_report(Y_validation, predictions)) #分类预测报告

未完待续... ...

参考如下：

https://machinelearningmastery.com/machine-learning-in-python-step-by-step/

http://nbviewer.jupyter.org/github/donnemartin/data-science-ipython-notebooks/blob/master/scikit-learn/scikit-learn-intro.ipynb

http://nbviewer.jupyter.org/github/rhiever/Data-Analysis-and-Machine-Learning-Projects/blob/master/example-data-science-notebook/Example%20Machine%20Learning%20Notebook.ipynb

机器学习初步：鸢尾花data建立第一个机器学习项目

Estimators

猜你喜欢