Introduction to Python machine learning-(1)

Introduction to Machine Learning

If you are a machine learning white like me, here I will take you through a simple project to get you started with machine learning. let's start!

1. Project Introduction

This project is to classify iris flowers. The data set contains the classification information of the three subgenus of iris flowers. It is saved as a model through machine learning to realize automatic classification. This project is a multi-class problem, supervised learning.
There are the following steps:
(1) Import data
(2) Overview data
(3) Data visualization
(4) Evaluation algorithm
(5) Implement prediction

2. Import data

2.1 Import library

code show as below:

# 导入类库
from pandas import read_csv
from pandas.plotting import scatter_matrix
from matplotlib import pyplot
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

Here if we fail to import, you'd better check if some libraries are not installed below.

2.2 Import data

We can download the iris data set in the UCI machine learning warehouse, and we can search it on Baidu. After downloading and saving to our working directory, then use Pandas to import csv data and statistical analysis of the data, and use Matplotlib for data visualization. When importing data, we set a name for each data, which will help us later in the display work. code show as below:

# 导入数据
filename = 'iris.data.csv'
names = ['separ-length', 'separ-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(filename, names=names)

2.3 Overview data

We do four things
(1) view the data dimension
(2) view the data itself
(3) count all data features
(4) the distribution of data classification The
following is the relevant code:

#显示数据维度
print('数据维度: 行 %s,列 %s' % dataset.shape)

# 查看数据的前10行
print(dataset.head(10))

# 统计描述数据信息
print(dataset.describe())

# 分类分布情况
print(dataset.groupby('class').size())

2.4 Data visualization

This is my favorite link. It is very exciting to think about so many boring data that can be turned into regular pictures. The
code is as follows:

# 箱线图
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()

# 直方图
dataset.hist()
pyplot.show()

# 散点矩阵图
scatter_matrix(dataset)
pyplot.show()

The results are as follows:


2.5 Evaluation algorithm

Add your models through different algorithms and evaluate their accuracy in order to find the most suitable algorithm. The following steps:
(1) Separate the evaluation data set
(2) Use the 10-fold cross-validation model
(3) Generate 6 different models to predict the new data
(4) Select the optimal model The
code is as follows:

# 分离数据集
array = dataset.values
X = array[:, 0:4]
Y = array[:, 4]
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation = \
    train_test_split(X, Y, test_size=validation_size, random_state=seed)

# 算法审查
models = {}
models['LR'] = LogisticRegression()
models['LDA'] = LinearDiscriminantAnalysis()
models['KNN'] = KNeighborsClassifier()
models['CART'] = DecisionTreeClassifier()
models['NB'] = GaussianNB()
models['SVM'] = SVC()
# 评估算法
results = []
for key in models:
    kfold = KFold(n_splits=10, random_state=seed)
    cv_results = cross_val_score(models[key], X_train, Y_train, cv=kfold, scoring='accuracy')
    results.append(cv_results)
    print('%s: %f (%f)' %(key, cv_results.mean(), cv_results.std()))

# 箱线图比较算法
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(models.keys())
pyplot.show()

2.6 Implementation forecast

The evaluation results show that SVM is the most accurate algorithm. Now use the reserved evaluation data set to verify this algorithm model. Will have an accurate and intuitive understanding of the generation algorithm.
Use the data of all training sets to generate the algorithm model of the support vector machine, and use the reserved evaluation data set to give a report of the algorithm model. code show as below:

#使用评估数据集评估算法
svm = SVC()
svm.fit(X=X_train, y=Y_train)
predictions = svm.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))

After executing the program, the accuracy is 0.8666666666666667
and a result analysis report is as follows:

Guess you like

Origin www.cnblogs.com/hhwblogs/p/12716165.html