This is homework three when I took data science last year. It was taught by teacher Xiao Ruoxiu at the time, but I heard that after this year, the computer science and the Internet of Things will be taught with the same level of difficulty. This article may just be mere record. My sister, but when I was in data science, Mr. Xiao didn't sign in. It's okay. After the last four assignments, I got a pretty good score, even if I don't take elective courses abroad.
Previous link:
Data Science Assignment 2_ House Transaction Price Prediction
Table of contents
6. Visualization of Linear Regression
7. Find the correlation between different features in the data set through the heat map
Fourth, the source code is attached
1. Job description
In this assignment, a set of iris data is provided, the data is iris, including 150 records, and the fields have been explained in the course. The purpose of this assignment is to accurately predict the iris category based on the four characteristics of petal width, petal length, sepal width, and sepal length. It mainly examines students' understanding and application of classification algorithms.
Specific requirements:
(1 ) Choose a reasonable three-class disassembly method, implement two classifiers in logistic regression, k-NN , SVM , and decision tree, reasonably determine hyperparameters, and select reasonable evaluation indicators to analyze classifier performance.
(2) Realize an integrated classifier, and select a reasonable evaluation index to analyze the performance of the classifier.
2. Operation process
1. Import related libraries
import numpy as np
import pandas as pd
from pandas import plotting
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import seaborn as sns
sns.set_style("whitegrid")
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
2. Read data
iris = pd.read_csv('iris.csv')
3. Draw Violinplot
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)
sns.violinplot(x='targetname', y='sepal length (cm)', data=iris, palette=antV, ax=axes[0, 0])
sns.violinplot(x='targetname', y='sepal width (cm)', data=iris, palette=antV, ax=axes[0, 1])
sns.violinplot(x='targetname', y='petal length (cm)', data=iris, palette=antV, ax=axes[1, 0])
sns.violinplot(x='targetname', y='petal width (cm)', data=iris, palette=antV, ax=axes[1, 1])
plt.show()
4. Draw pointplot
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)
sns.pointplot(x='targetname', y='sepal length (cm)', data=iris, color=antV[0], ax=axes[0, 0])
sns.pointplot(x='targetname', y='sepal width (cm)', data=iris, color=antV[0], ax=axes[0, 1])
sns.pointplot(x='targetname', y='petal length (cm)', data=iris, color=antV[0], ax=axes[1, 0])
sns.pointplot(x='targetname', y='petal width (cm)', data=iris, color=antV[0], ax=axes[1, 1])
plt.show()
5. Use Andrews Curves to convert each multivariate observation into a curve and represent the coefficients of the Fourier series, which is useful for detecting outliers in time series data.
plt.subplots(figsize = (10,8))
plotting.andrews_curves(iris, 'targetname', colormap='cool')
plt.show()
g = sns.lmplot(data=iris, x='sepal width (cm)', y='sepal length (cm)', palette=antV, hue='targetname')
6. Visualization of Linear Regression
g = sns.lmplot(data=iris, x='sepal width (cm)', y='sepal length (cm)', palette=antV, hue='targetname')
g = sns.lmplot(data=iris, x='petal width (cm)', y='petal length (cm)', palette=antV, hue='targetname')
7. Find the correlation between different features in the data set through the heat map
fig=sns.heatmap(iris.corr(), annot=True, cmap='GnBu', linewidths=1, linecolor='k',
square=True, mask=False, vmin=-1, vmax=1, cbar_kws={"orientation": "vertical"}, cbar=True)
8. Machine Learning
X = iris[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris['targetname']
encoder = LabelEncoder()
y = encoder.fit_transform(y)
#print(y)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 101)
#print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
# Support Vector Machine
model = svm.SVC()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the SVM is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
# Logistic Regression
model = LogisticRegression()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the Logistic Regression is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
# Decision Tree
model=DecisionTreeClassifier()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the Decision Tree is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
# K-Nearest Neighbours
model=KNeighborsClassifier(n_neighbors=3)
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the KNN is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
The accuracy of the four methods:
The accuracy of the SVM is: 0.9777777777777777
The accuracy of the Logistic Regression is: 0.9777777777777777
The accuracy of the Decision Tree is: 0.9555555555555556
The accuracy of the KNN is: 1.0
3. Visualization results
Fourth, the source code is attached:
import numpy as np
import pandas as pd
from pandas import plotting
#matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import seaborn as sns
sns.set_style("whitegrid")
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm
from sklearn import metrics
from sklearn.tree import DecisionTreeClassifier
iris = pd.read_csv('iris.csv')
#iris.info()
# 设置颜色主题
antV = ['#1890FF', '#2FC25B', '#FACC14', '#223273', '#8543E0', '#13C2C2', '#3436c7', '#F04864']
# 绘制 Violinplot
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)
sns.violinplot(x='targetname', y='sepal length (cm)', data=iris, palette=antV, ax=axes[0, 0])
sns.violinplot(x='targetname', y='sepal width (cm)', data=iris, palette=antV, ax=axes[0, 1])
sns.violinplot(x='targetname', y='petal length (cm)', data=iris, palette=antV, ax=axes[1, 0])
sns.violinplot(x='targetname', y='petal width (cm)', data=iris, palette=antV, ax=axes[1, 1])
plt.show()
f, axes = plt.subplots(2, 2, figsize=(8, 8), sharex=True)
sns.despine(left=True)
sns.pointplot(x='targetname', y='sepal length (cm)', data=iris, color=antV[0], ax=axes[0, 0])
sns.pointplot(x='targetname', y='sepal width (cm)', data=iris, color=antV[0], ax=axes[0, 1])
sns.pointplot(x='targetname', y='petal length (cm)', data=iris, color=antV[0], ax=axes[1, 0])
sns.pointplot(x='targetname', y='petal width (cm)', data=iris, color=antV[0], ax=axes[1, 1])
plt.show()
#g = sns.pairplot(data=iris, palette=antV, hue= 'targetname')
plt.subplots(figsize = (10,8))
plotting.andrews_curves(iris, 'targetname', colormap='cool')
plt.show()
g = sns.lmplot(data=iris, x='sepal width (cm)', y='sepal length (cm)', palette=antV, hue='targetname')
g = sns.lmplot(data=iris, x='petal width (cm)', y='petal length (cm)', palette=antV, hue='targetname')
fig=plt.gcf()
fig.set_size_inches(12, 8)
fig=sns.heatmap(iris.corr(), annot=True, cmap='GnBu', linewidths=1, linecolor='k',
square=True, mask=False, vmin=-1, vmax=1, cbar_kws={"orientation": "vertical"}, cbar=True)
X = iris[['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']]
y = iris['targetname']
encoder = LabelEncoder()
y = encoder.fit_transform(y)
#print(y)
train_X, test_X, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 101)
#print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)
# Support Vector Machine
model = svm.SVC()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the SVM is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
# Logistic Regression
model = LogisticRegression()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the Logistic Regression is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
# Decision Tree
model=DecisionTreeClassifier()
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the Decision Tree is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
# K-Nearest Neighbours
model=KNeighborsClassifier(n_neighbors=3)
model.fit(train_X, train_y)
prediction = model.predict(test_X)
print('The accuracy of the KNN is: {0}'.format(metrics.accuracy_score(prediction,test_y)))
5. Experience
Through the study of the iris case, I have a preliminary understanding of the content of machine learning and feel the charm of this subject