Introduction Machine Learning / K-NN nearest neighbor

I. Introduction Machine Learning

(A) What is Machine Learning

Machine learning is to convert unordered data into useful information.

(B) the term

Characteristic or attribute of the training set is usually columns, they are the result of independent measurements obtained, wherein a plurality of linked together to form a training sample. The main task of machine learning is classified. Usually enter a lot of data for the algorithm classified as a training set of algorithms. To test the effect of machine learning algorithms typically use two independent sample sets: training data and test data.

Task (c) machine learning

Another task is to return the machine learning, it is mainly used to predict numeric data. Classification and Regression belong supervised learning, is called a supervised learning, because these algorithms must know to predict what that classified information of the target variable. Corresponding to the supervised learning unsupervised learning, then there is no data type information, do not give definite target. In unsupervised learning, the data set into a plurality of classes by a process similar objects is called cluster; data describing the process of finding a statistical value called density estimate.

(D) How to choose the algorithm

First consider the purpose of using a machine learning algorithm. If you want to predict the value of the target variable, you can select supervised learning algorithm, or you can choose unsupervised learning algorithm. After determining the selected supervised learning algorithm is needed to determine the target variable type, if the target variables are discrete, yes / no, 1/2/3, A / B / C or red / yellow / black and the like, may be selected classifier algorithm ; If the target variable is a continuous value, such as 0.0 to 100.00, -999 to + ∞ ~ -∞ 999 or the like, it is necessary to select regression algorithms. If you do not want to predict the value of the target variable, you can choose unsupervised learning algorithm. Further analysis of the need to divide the data into discrete groups. If this is the only requirement is to use clustering algorithm; if it is necessary to estimate the degree of similarity with each packet of data, you need to use density estimation algorithm.

Next to consider is the data problem. The main should be aware of the following characteristic data: characteristic values ​​are discrete variables or continuous variables, whether the value of the missing presence of characteristic values, resulting in missing values ​​for any reason, whether there are outliers in the data, how often a feature of incidence ( It is rare as if Haidilaozhen), and so on. To fully understand the characteristics of the data mentioned above can shorten the time to select a machine learning algorithm.

(E) the development steps

Using a machine learning algorithm development application, we typically follow the following steps. (1) data collection. (2) prepare input data. (3) analyzing input data. (4) training algorithm. (5) test algorithm. (6) using an algorithm.

(Vi) combat: sales forecast by the amount of advertising

1. collect, prepare data

Your company is doing product advertising on television, the television advertising collected x (in millions) and product sales y (in billions units) of data. As your company's data scientists hope that by analyzing these data Learn TV ad spending x (in millions) relationships with product sales y. Assuming that the relationship between x and y is linear, that is y = ax + b. By linear regression (Linear Regression), we can learn the value of a and b. So we plan to do in the future when, x inputs through television advertising, product sales can be predicted y, which can make production and logistics, warehousing and planning ahead. to provide customers with better service.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

data = pd.read_csv("data/Advertising.csv")

 2. Data analysis

 

data.head()

data.columns

data.info()

 

RangeIndex: 200 entries, 0 to 199
Data columns (total 2 columns):
TV       200 non-null float64
sales    200 non-null float64
dtypes: float64(2)

TV is characterized by continuous value. sales for the label, the value of continuity, so select regression algorithms. Make a scatter plot, observations found data showing the relationship between lines appear, so select linear regression.

plt.figure(figsize=(16, 8))
plt.scatter(data['TV'], data['sales'], c ='black')
plt.xlabel("Money spent on TV ads")
plt.ylabel("Sales")
plt.show()

3. training, testing algorithm

X = data['TV'].values.reshape(-1,1)
y = data['sales'].values.reshape(-1,1)

reg = LinearRegression()
reg.fit(X, y)

print('a = {:.5}'.format(reg.coef_[0][0]))
print('b = {:.5}'.format(reg.intercept_[0]))

print("线性模型为: Y = {:.5}X + {:.5} ".format(reg.coef_[0][0], reg.intercept_[0]))
a = 0.047537
b = 7.0326
线性模型为: Y = 0.047537X + 7.0326 

 

predictions = reg.predict(X)

plt.figure(figsize=(16, 8))
plt.scatter(data['TV'], data['sales'], c ='black')
plt.plot(data['TV'], predictions,c ='red', linewidth=2)
plt.xlabel("Money spent on TV ads")
plt.ylabel("Sales")
plt.show()

4. Using the algorithm

predictions = reg.predict([[100]])
print('投入一亿元的电视广告, 预计的销售量为{:.5}亿'.format( predictions[0][0]) )

Invest one hundred million yuan of television advertising, estimated sales of 1.1786 billion.

(Vii) operations: predict the relationship between altitude and air temperature

The temperature will rise as altitude decreases, we can predict the relationship between altitude and temperature by measuring the temperature at different altitudes. We assume that the relationship between the altitude and temperature can be expressed using the following equation: y (temperature) = a * x (altitude) + b. Theoretically, the values ​​of a and b in the above equations only two different heights in test, it can be calculated the values ​​of a and b. However, since the devices are all erroneous, the use of more highly Test the value can make the predicted value is more accurate. we provide a temperature value in nine different height measurements, you predicted values ​​of a and b linear regression method today of learning. according to this formula, we expect about 8,000 meters above sea level, the temperature will be how much?

1. collect, prepare data

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

data = pd.read_csv("exercise/height.vs.temperature.csv")

 2. Data analysis

data.info()
Data columns (total 2 columns):
height         9 non-null float64
temperature    9 non-null float64
dtypes: float64(2)
plt.figure(figsize=(16,8))
plt.scatter(data['height'],data['temperature'],c='black')
plt.xlabel("Height")
plt.ylabel("Temperature")
plt.show()

 3. training, testing algorithm

X=data['height'].values
X

X=data['height'].values.reshape(-1,1)
X

y=data['temperature'].values
y

y=data['temperature'].values.reshape(-1,1)
y

reg = LinearRegression()
reg.fit(X,y)

print('a={:.5}'.format(reg.coef_[0][0]))
print('b={:.5}'.format(reg.intercept_[0]))

predictions = reg.predict(X)
plt.figure(figsize=(16,8))
plt.scatter(data['height'],data['temperature'],c='black')
plt.plot(data['height'],predictions,c='b',linewidth=2)
plt.xlabel("Height")
plt.ylabel("Temperature")
plt.show()

 

 

4. Using the algorithm

predictions = reg.predict([[8000]])
print('在8000米的海拔, 气温会是{:.5}。'.format( predictions[0][0]))

8,000 meters above sea level, the temperature will be -39.838.
 

Two, K-NN nearest neighbor

(A) Overview algorithm

k- nearest neighbor distance measurement method using the characteristic values ​​between the different classification. There is a sample data set, also called training set and each data sample set tag exists, i.e. we know correspondence relationship between each data sample set with the category. After entering the new data without a label, the feature data corresponding to the characteristics of each new data sample and comparing focus, then concentrated extraction algorithm wherein data is most similar to the sample (nearest neighbor) class label. In general, we select only the k most similar data before the sample data set, which is k- nearest neighbor in the origin of k, k is usually not an integer greater than 20. Finally, select the highest number of classified data k most similar in appearance, as the classification of new data.

General procedure k- nearest-neighbors algorithm: (1) Data collection: Any method may be used. (2) Preparation data: distance calculating the desired value, preferably a structured data format. (3) Data analysis: Any method may be used. (4) training algorithm: This step does not apply to k- nearest neighbor. (5) test algorithm: calculation error rate. (6) using the algorithm: first need to input the output sample data and structured, and then run the k- nearest neighbor classification algorithm determines that the input data which belong to the last application of the subsequent processing performed on the calculated classification.

(B) combat: use k- nearest neighbor algorithm to predict the market in used car prices

1. collect, prepare data

 

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data.csv')

wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

2. Data analysis

data.head()
data.info()

wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

RangeIndex: 13 entries, 0 to 12
Data columns (total 8 columns):
Brand                13 non-null object
Type                 13 non-null float64
Color                13 non-null object
Construction Year    13 non-null int64
Odometer             13 non-null int64
Ask Price            13 non-null int64
Days Until MOT       13 non-null int64
HP                   13 non-null int64
dtypes: float64(1), int64(5), object(2)

Brand, Type, Color, Construction Year, Odometer, Days Until MOT, HP is characterized wherein, Brand, Type, Color is a discrete value, Construction Year, Odometer, Days Until MOT, HP is a continuous value. Ask Price label, it is a continuous value, so select regression algorithms.

y = data['Ask Price']
y
X = data.drop(['Ask Price'],axis=1)
X.columns

wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

观察数据可知,Brand列为相同的值,可以将其删除。Type、Color为离散型数值,在机器学习中,大多数算法,譬如逻辑回归,支持向量机SVM,k近邻算法等都只能够处理数值型数据,不能处理文字。在这种情况下,为了让数据适应算法和库,必须将数据进行编码,将文字型数据转换为数值型。Type、Color为名义变量,即变量之间是相互独立的,彼此之间完全没有联系,通过使用哑变量的方式来处理,才能够尽量向算法传达最准确的信息。Construction Year、Odometer、Days Until MOT、HP为连续型的数值。在距离类模型,譬如k近邻,K-Means聚类中,无量纲化可以提升模型精度,避免某一个取值范围特别大的特征对距离计算造成影响。因此将Construction Year、Odometer、Days Until MOT、HP的数据进行标准化处理。

from sklearn.preprocessing import OneHotEncoder

X1 = X.iloc[:,1:3]
X1.head()

ohe = OneHotEncoder()
X2 = ohe.fit_transform(X1).toarray()
X2

ohe.get_feature_names()

from sklearn.preprocessing import StandardScaler

X

X3 = X.iloc[:,3:]
X3.columns

scaler = StandardScaler()
X4 = scaler.fit_transform(X3)

X6 = pd.concat([pd.DataFrame(X2),pd.DataFrame(X4)],axis=1)
X6

X6.columns = [ 'x0_1.0', 'x0_1.1', 'x0_1.4', 'x1_black', 'x1_blue', 'x1_green',
       'x1_grey', 'x1_red', 'x1_white','Construction Year','Odometer', 'Days Until MOT','HP']
X6

wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

原始数据有8个特征,减掉Brand特征,增加前面独热编码生成的特征,新的数据有13个特征。为了让算法运算更快,效果更好,使用PCA(主成分分析)算法对数据进行降维,即降低特征矩阵中特征的数量。

from sklearn.decomposition import PCA

X6.shape

#调用PCA
pca = PCA(n_components=2) #实例化
pca = pca.fit(X6) #拟合模型
X_dr = pca.transform(X6) #获取新矩阵
X_dr

pca.explained_variance_
pca.explained_variance_ratio_
pca.explained_variance_ratio_.sum()
pca.explained_variance_
pca.explained_variance_ratio_
pca.explained_variance_ratio_.sum()

pca_line = PCA().fit(X6)
plt.plot(list(range(13)),np.cumsum(pca_line.explained_variance_ratio_))
plt.xticks(list(range(13))) #这是为了限制坐标轴显示为整数
plt.xlabel("number of components after dimension reduction")
plt.ylabel("cumulative explained variance ratio")
plt.show()

pca = PCA(n_components=4)
X_dr = pca.fit_transform(X6)
X_dr

wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

3.训练、测试算法

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X6,y,random_state=0)

X_train

X_test

y_train

knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(X_train,y_train)

result = knn.predict(X_test)

result

y_test

plt.scatter(result, y_test)
diagonal = np.linspace(500, 1500, 100)
plt.plot(diagonal, diagonal, '-r')
plt.xlabel('Predicted ask price')
plt.ylabel('Ask price')
plt.show()

wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

在回归类算法中,有两种不同的角度来看待回归的效果:第一,我们是否预测到了正确的数值。均方误差,本质是在RSS的基础上除以了样本总量,得到了每个样本量上的平均误差。有了平均误差,就可以将平均误差和标签的取值范围在一起比较,以此获得一个较为可靠的评估依据。

第二,我们是否拟合到了足够的信息。对于回归类算法而言,只探索数据预测是否准确是不足够的。除了数据本身的数值大小之外,我们还希望我们的模型能够捕捉到数据的”规律“,比如数据的分布规律,单调性等等,而是否捕获了这些信息并无法使用MSE来衡量。如果一条曲线前半部分的拟合非常成功,真实标签和我们的预测结果几乎重合,但后半部分的拟合却非常糟糕,模型向着与真实标签完全相反的方向去了。对于这样的一个拟合模型,如果使用MSE来对它进行判断,它的MSE会很小,因为大部分样本其实都被完美拟合了,少数样本的真实值和预测值的巨大差异在被均分到每个样本上之后,MSE就会很小。但这样的拟合结果必然不是一个好结果,因为一旦新样本是处于拟合曲线的后半段的,预测结果必然会有巨大的偏差,而这不是我们希望看到的。所以,除了判断预测的数值是否正确之外,还能够判断我们的模型是否拟合了足够多的,数值之外的信息。为了衡量模型对数据上的信息量的捕捉,我们定义R^{^{2}}来帮助我们:方差的本质是任意一个值和样本均值的差异,差异越大,这些值所带的信息越多。R^{^{2}}越接近1越好。

from sklearn.metrics import mean_squared_error, r2_score

# 均方误差
print("Mean squared error: %.2f" % mean_squared_error(y_test, result))
# 方差分数: 1代表完美预测
print('Variance score: %.2f' % r2_score(y_test, result))

wAAACH5BAEKAAAALAAAAAABAAEAAAICRAEAOw==

Mean squared error: 54294.12
Variance score: 0.66

从均方误差和方差的结果来看,模型预测的效果不太好。

Guess you like

Origin blog.csdn.net/weixin_42295205/article/details/94125881