1. Introduction to machine learning

1. Development history of machine learning

Simply use a mind map to sort out the development process of machine learning:
Insert picture description here

In the past two decades, the ability of humans to collect , store , transmit , and process data has been rapidly improved. Every corner of human society has accumulated a large amount of data. There is an urgent need for machine algorithms that can effectively analyze and use data. Learning conforms to the needs of the great age, so this subject area has naturally made great developments and received widespread attention.

2. Machine learning task classification

Machine learning can be roughly divided into the following categories:

Insert picture description here

Focus on explaining the meaning of "supervision". Taking supervised learning as an example, the task of supervised learning is to try to learn a mathematical model so that the model can make a good prediction of its corresponding output for any given input. Supervision means that the output exists and is known (in the case of classification, the type is known, in the case of regression, the range is known). The purpose of semi-supervised clustering learning and unsupervised learning is to explore the internal relationships and structure in the data.

2. Demonstration of machine learning example code

1. Preparation

Use anaconda to build a running environment
Reference tutorial:
Win10 uses Anaconda to build a Pandas+Numpy environment
Install related packages (here can also be written in requirements.txt for unified installation)

pip install numpy
pip install pandas
pip install sklearn
pip install matplotlib
pip install seaborn

Insert picture description here

Introduce related packages into the code header

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline 
plt.style.use("ggplot")      
import seaborn as sns
from sklearn import datasets

2. Return

1) Overview of regression analysis

In statistics, regression analysis refers to a statistical analysis method that determines the quantitative relationship between two or more variables.

According to the type of relationship between the independent variable and the dependent variable, it can be divided into linear regression analysis and nonlinear regression analysis.

Among them, linear regression analysis can be divided into unary linear regression and multiple linear regression.

2) Return case

The Boston Housing Price Data Set contains information about housing prices in Boston, Massachusetts, US collected by the US Census Bureau. The data set is very small, with only 506 cases.

variable name	Description
CRIM	Crime rate of urban population
ZN	Percentage of residential land with more than 25,000 square feet
INDUS	Proportion of urban non-retail business areas
CHAS	Charles River dummy variable (if the land is by the river=1; otherwise it is 0)
NOX	Nitric oxide concentration (per 10 million shares)
RM	Average number of rooms per resident
AGE	Percentage of owner-occupied units built before 1940
DIS	Weighted distance to five Boston job centers
RAD	Accessibility index of radial highway
TAX	Full property tax rate per $10,000
RTRATIO	Urban teacher-student ratio
B	1000(Bk-0.63)^2 where Bk is the proportion of urban blacks
LSTAT	Percentage of lower-status people in the population
WITH V	(Target Variable/Category Attribute) The median of self-owned housing calculated at $1,000

View the data set:

v_housing = datasets.load_boston() 
X = v_housing.data
y = v_housing.target
features = v_housing.feature_names
boston_data = pd.DataFrame(X,columns=features)
boston_data["Price"] = y
boston_data.head()

Insert picture description here
View feature types:

print(v_housing.feature_names)
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO'
 'B' 'LSTAT']

Visualize thirteen variables and prices in a scatter plot:

v_namedata = v_housing.feature_names
for i in range(len(v_housing.feature_names)):
    sns.scatterplot(x=X[:,i],y=y,color="r",alpha=0.6)
    plt.title(v_namedata[i])
    plt.show()

Insert picture description here

It can be seen that the average number of rooms (RM) and the price (Prine) of a house are closer to a positive linear correlation. Let us find the relevant coefficient weight w and offset b:

from sklearn.linear_model import LinearRegression
plt.rcParams['font.sans-serif'] = ['SimHei']
x = v_housing.data[:,5,np.newaxis]
y = v_housing.target
lm = LinearRegression()
lm.fit(x,y)
print('方程的确定性系数R的平方：',lm.score(x,y))
print('线性回归算法的权重w：',lm.coef_)
print('线性回归算法的偏移量b：',lm.intercept_)
plt.scatter(x,y,color='blue')
plt.plot(x,lm.predict(x),color='green',linewidth=6)
plt.xlabel('住宅平均房间数(RM)')
plt.ylabel(u'房屋价格')
plt.title('线性回归住宅平均房间数RM与房屋价格Price的关系')
plt.show()
方程的确定性系数R的平方： 0.48352545599133423
线性回归算法的权重w： [9.10210898]
线性回归算法的偏移量b： -34.670620776438554

Insert picture description here

3. Classification case

Kite (yuān) tail flower

Iris Data Set (Iris Plant Data Set) is a relatively long-standing data set. It first appeared in the 1936 paper "The use of multiple measurements in taxonomic problems" by the famous British statistician and biologist Ronald Fisher. It is used to introduce linear discriminant analysis.

In this data set, three different types of Iris plants are included: Iris Setosa, Iris Versicolour, and Iris Virginica. Each category collects 50 samples, so this data set contains a total of 150 samples, and each sample selects 4 representative features. They are:

sepallength: the length of the sepal, in centimeters
sepalwidth: the width of the sepal, in centimeters
petallength: petal length, in centimeters
petalwidth: petal width, in centimeters

iris = datasets.load_iris()
X = iris.data
y = iris.target
features = iris.feature_names
iris_data = pd.DataFrame(X,columns=features)
iris_data['target'] = y
iris_data.head()

Insert picture description here

#自定义marker
marker = ['s','x','o']
#依次遍历三种不同的花
for index,c in enumerate(np.unique(y)):
    plt.scatter(x=iris_data.loc[y==c,"sepal length (cm)"],y=iris_data.loc[y==c,"sepal width (cm)"],alpha=0.8,label=c,marker=marker[c])
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.legend()
plt.show()

Insert picture description here

Explanation of plt.scatter parameters:

x,y: two-dimensional data of scatter plot
alpha: Transparency, the range is [0,1], from transparent to opaque
label: represents the name of the point
marker: MarkerStyle, default'o', the specific meaning is as follows:

Now explore why there are purple squares, take (6.0,2.2) as an example:

Insert picture description here

Dismantling analysis:

index = c = 1
plt.scatter(x=iris_data.loc[y==c,"sepal length (cm)"],y=iris_data.loc[y==c,"sepal width (cm)"],alpha=0.8,label=c,marker=marker[c])
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.legend()
plt.show()

Insert picture description here

index = c = 2
plt.scatter(x=iris_data.loc[y==c,"sepal length (cm)"],y=iris_data.loc[y==c,"sepal width (cm)"],alpha=0.8,label=c,marker=marker[c])
plt.xlabel("sepal length (cm)")
plt.ylabel("sepal width (cm)")
plt.legend()
plt.show()

Insert picture description here
Draw a conclusion: when the x number and the circle overlap, it will become a square.

More specifically, we can use the pandas.plotting.scatter_matrix method to generate a matrix of pairwise scatter plots of all features:

from sklearn.model_selection import train_test_split
import mglearn
X_train,X_test,y_train,y_test = train_test_split(iris['data'],iris['target'],test_size=0.2,random_state=0)
iris_dataframe=pd.DataFrame(X_train,columns=iris.feature_names)
grr=pd.plotting.scatter_matrix(iris_dataframe,c=y_train,figsize=(15,15),marker='o',hist_kwds={
    
    'bins':20},s=60,alpha=.8,cmap=mglearn.cm3)

Insert picture description here
Next, let us use the Iris data set to build a predictive model:

X_train, X_test, y_train, y_test = train_test_split(
iris['data'], iris['target'], test_size=0.3,random_state=0)
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)
print("Test set score: {:.2f}".format(knn.score(X_test, y_test)))
result = knn.predict(X_new)
Test set score: 0.98

Test the flower data in the fifth row:

X_new = np.array([[5,3.6,1.4,0.2]])
result = knn.predict(X_new)
print('Prediction:{}'.format(result))
print('Predicted target name:{}'.format(iris_dataset['target_names'][result]))
Prediction:[0]
Predicted target name:['setosa']

4. Cases of unsupervised learning

# 生成月牙型非凸集
x, y = datasets.make_moons(n_samples=2000, shuffle=True,
                  noise=0.05, random_state=None)
for index,c in enumerate(np.unique(y)):
    plt.scatter(x[y==c,0],x[y==c,1],s=7)
plt.show()

Insert picture description here
Among them, the function of the for loop is to draw the upper half circle first, and then the lower half circle. We can also use another method without using the loop:

plt.scatter(x[:,0],x[:,1],c=y,s=7)
plt.show()

Insert picture description here

# 生成符合正态分布的聚类数据
x, y = datasets.make_blobs(n_samples=5000, n_features=2, centers=3)
for index,c in enumerate(np.unique(y)):
    plt.scatter(x[y==c, 0], x[y==c, 1],s=7)
plt.show()

Insert picture description here
For this kind of clustered data, you can also skip the loop and draw directly:

plt.scatter(x[:, 0], x[:, 1],s=7,c=y)
plt.show()

Insert picture description here

Digression

Why do you feel that this jupyter notebook is selling cute to me? ? ?

Insert picture description here

references

[1] Zhou Zhihua, Machine Learning, Tsinghua University Press, 2016
[2] Li Hang, Statistical Learning Methods, Tsinghua University Press, 2012
[3] PYthon-detailed explanation of various parameters of plt.scatter
[4] Boston housing price prediction- Regression analysis case (dedicated to beginners)
[5] Introduction to machine learning: starting from Iris data classification

The three main tasks of machine learning

1. Introduction to machine learning

1. Development history of machine learning

2. Machine learning task classification

2. Demonstration of machine learning example code

1. Preparation

2. Return

1) Overview of regression analysis

2) Return case

3. Classification case

4. Cases of unsupervised learning

Digression

references

Guess you like