## 1. Linear regression

### 1. Cost function

• in:
• The following is to ask for theta to minimize the cost, which means that the equation we fitted is closest to the true value
• There are m pieces of data, which represent the square of the distance from the equation we want to fit to the true value. The reason for the square is that there may be negative values, and the positive and negative may offset
• `2`The reason for the coefficient in front is that the gradient below is to calculate the partial derivative for each variable, which `2`can be eliminated
• Implementation code:
``````# 计算代价函数
def computerCost(X,y,theta):
m = len(y)
J = 0
``````J = (np.transpose(X*theta-y))*(X*theta-y)/(2*m) #计算代价J
return J
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• Note that X here is a column of 1 added before the real data, because there is theta(0)

### 2. Gradient descent algorithm

• The partial derivative of the cost function is obtained:
• So the update to theta can be written as:
• Among them is the learning rate, which controls the speed of gradient descent, generally 0.01, 0.03, 0.1, 0.3...
• Why gradient descent can gradually reduce the cost function
• hypothetical function`f(x)`
• Taylor expands:`f(x+△x)=f(x)+f'(x)*△x+o(△x)`
• Make: `△x=-α*f'(x)`, that is, the negative gradient direction is multiplied by a small step size`α`
• `△x`Substitute into the Taylor expansion:`f(x+x)=f(x)-α*[f'(x)]+o(△x)`
• It can be seen that `α`a very small positive number is obtained, and `[f'(x)]`it is also a positive number, so it can be concluded that:`f(x+△x)<=f(x)`
• So put down along the negative gradient , the function is decreasing, and the multi-dimensional case is the same.
• Implementation code
``````# 梯度下降算法
def gradientDescent(X,y,theta,alpha,num_iters):
m = len(y)
n = len(theta)
``````temp = np.matrix(np.zeros((n,num_iters)))   # 暂存每次迭代计算的theta，转化为矩阵形式

J_history = np.zeros((num_iters,1)) #记录每次迭代计算的代价值

for i in range(num_iters):  # 遍历迭代次数
h = np.dot(X,theta)     # 计算内积，matrix可以直接乘
temp[:,i] = theta - ((alpha/m)*(np.dot(np.transpose(X),h-y)))   #梯度的计算
theta = temp[:,i]
J_history[i] = computerCost(X,y,theta)      #调用计算代价函数
print '.',
return theta,J_history
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17

### 3. Mean normalization

• The purpose is to scale the data to a range, which is convenient for using the gradient descent algorithm
• Where is the average of all this feture data
• It can be the maximum-minimum value , or the standard deviation of the data corresponding to this feature
• Implementation code:
``````# 归一化feature
def featureNormaliza(X):
X_norm = np.array(X)            #将X转化为numpy数组对象，才可以进行矩阵的运算
#定义所需变量
mu = np.zeros((1,X.shape[1]))
sigma = np.zeros((1,X.shape[1]))
``````mu = np.mean(X_norm,0)          # 求每一列的平均值（0指定为列，1代表行）
sigma = np.std(X_norm,0)        # 求每一列的标准差
for i in range(X.shape[1]):     # 遍历列
X_norm[:,i] = (X_norm[:,i]-mu[i])/sigma[i]  # 归一化

return X_norm,mu,sigma
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• Note that mean-normalized data is also required for forecasting

### 4. The final running result

• Variation of cost with number of iterations

### 5. Implemented using the linear model in the scikit-learn library

• import package
``````from sklearn import linear_model
from sklearn.preprocessing import StandardScaler    #引入缩放的包
```

1

2

```
• Normalized
``````    # 归一化操作
scaler = StandardScaler()
scaler.fit(X)
x_train = scaler.transform(X)
x_test = scaler.transform(np.array([1650,3]))
```

1

2

3

4

5

```
• Linear Model Fitting
``````    # 线性模型拟合
model = linear_model.LinearRegression()
model.fit(x_train, y)
```

1

2

3

```
• predict
``````    #预测结果
result = model.predict(x_test)
```

1

2

```

## 2. Logistic regression

### 1. Cost function

• Can be combined as:

where:
• Why not use the cost function of linear regression, because the cost function of linear regression may be non-convex. For classification problems, it is difficult to obtain the minimum value using gradient descent. The above cost function is a convex function
• The image is as follows, `y=1`instant :

It can be seen that when it tends to `1`, `y=1`, is consistent with the predicted value, the price paid at this time `cost`tends `0`and if it tends to `0`, `y=1`, the cost `cost`value is very large, and our ultimate goal is to minimize the cost value

• The same image is as follows ( `y=0`):

### 2. Gradient

• Also find the partial derivative of the cost function:

it can be seen that it is consistent with the partial derivative of linear regression
• push to process

### 3. Regularization

• The purpose is to prevent overfitting
• Add a term to the cost function
• Note that j starts with 1, because theta(0) is a constant term, and the first column in X will add 1 column, so the product is still theta(0), feature does not matter, and regularization is not necessary
• Regularized cost:
``````# 代价函数
def costFunction(initial_theta,X,y,inital_lambda):
m = len(y)
J = 0
``````h = sigmoid(np.dot(X,initial_theta))    # 计算h(z)
theta1 = initial_theta.copy()           # 因为正则化j=1从1开始，不包含0，所以复制一份，前theta(0)值为0
theta1[0] = 0

temp = np.dot(np.transpose(theta1),theta1)
J = (-np.dot(np.transpose(y),np.log(h))-np.dot(np.transpose(1-y),np.log(1-h))+temp*inital_lambda/2)/m   # 正则化的代价方程
return J
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• The gradient of the regularized cost
``````# 计算梯度
def gradient(initial_theta,X,y,inital_lambda):
m = len(y)
grad = np.zeros((initial_theta.shape[0]))
``````h = sigmoid(np.dot(X,initial_theta))# 计算h(z)
theta1 = initial_theta.copy()
theta1[0] = 0

grad = np.dot(np.transpose(X),h-y)/m+inital_lambda/m*theta1 #正则化的梯度
return grad
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11

### 4. Sigmoid function (ie )

• Implementation code:
``````# S型函数
def sigmoid(z):
h = np.zeros((len(z),1))    # 初始化，与z的长度一置
``````h = 1.0/(1.0+np.exp(-z))
return h
```
```
• 1
• 2
• 3
• 4
• 5
• 6

### 5. Mapping to polynomial

• Because there may be few fetures in the data, resulting in large deviations, some feture combinations are created
• eg: Mapped to the power of 2:
• Implementation code:
``````# 映射为多项式
def mapFeature(X1,X2):
degree = 3;                     # 映射的最高次方
out = np.ones((X1.shape[0],1))  # 映射后的结果数组（取代X）
'''
这里以degree=2为例，映射为1,x1,x2,x1^2,x1,x2,x2^2
'''
for i in np.arange(1,degree+1):
for j in range(i+1):
temp = X1**(i-j)*(X2**j)    #矩阵直接乘相当于matlab中的点乘.*
out = np.hstack((out, temp.reshape(-1,1)))
return out
```

1

2

3

4

5

6

7

8

9

10

11

12

```

### 6. The optimization method `scipy`used

• Gradient descent `scipy`uses the function`optimize` in`fmin_bfgs`
• Call the optimization algorithm fmin_bfgs in scipy (quasi-Newton method Broyden-Fletcher-Goldfarb-Shanno
• costFunction is a cost-seeking function implemented by itself.
• initial_theta represents the initialized value,
• fprime specifies the gradient of costFunction
• args is the remaining test parameters, passed in as a tuple, and finally returns the theta that minimizes the costFunction
``````    result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, args=(X,y,initial_lambda))
```

1

```

### 7. Running results

• data1 decision boundary and accuracy

• data2 decision boundaries and accuracy

### 8. Use the logistic regression model in the scikit-learn library to implement

• import package
``````from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.cross_validation import train_test_split
import numpy as np
```

1

2

3

4

```
• Divide training set and test set
``````    # 划分为训练集和测试集
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
```

1

2

```
• Normalized
``````    # 归一化
scaler = StandardScaler()
scaler.fit(x_train)
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)
```

1

2

3

4

5

```
• logistic regression
``````    #逻辑回归
model = LogisticRegression()
model.fit(x_train,y_train)
```

1

2

3

```
• predict
``````    # 预测
predict = model.predict(x_test)
right = sum(predict == y_test)
``````predict = np.hstack((predict.reshape(-1,1),y_test.reshape(-1,1)))   # 将预测值和真实值放在一块，好观察
print predict
print ('测试集准确率：%f%%'%(right*100.0/predict.shape[0]))          #计算在测试集上的准确度
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7

## Logistic Regression_Handwritten Digit Recognition_OneVsAll

### 1. Randomly display 100 numbers

• I did not use the data set in scikit-learn, the pixel is 20*20px, the color map is as follows

grayscale image:
• Implementation code:
``````# 显示100个数字
def display_data(imgData):
sum = 0
'''
显示100个数（若是一个一个绘制将会非常慢，可以将要画的数字整理好，放到一个矩阵中，显示这个矩阵即可）
- 初始化一个二维数组
- 将每行的数据调整成图像的矩阵，放进二维数组
- 显示即可
'''
pad = 1
display_array = -np.ones((pad+10*(20+pad),pad+10*(20+pad)))
for i in range(10):
for j in range(10):
display_array[pad+i*(20+pad):pad+i*(20+pad)+20,pad+j*(20+pad):pad+j*(20+pad)+20] = (imgData[sum,:].reshape(20,20,order="F"))    # order=F指定以列优先，在matlab中是这样的，python中需要指定，默认以行
sum += 1
``````plt.imshow(display_array,cmap='gray')   #显示灰度图像
plt.axis('off')
plt.show()
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17
• 18
• 19

### 2、OneVsAll

• How to use logistic regression to solve the problem of multi-classification, OneVsAll is to regard a certain class as one class, and all other classes as one class, so that it becomes a problem of two classifications
• As shown in the figure below, divide the data on the way into three categories, first treat the red ones as one category, and the others as another category, perform logistic regression, then treat the blue ones as one category, and the others as one category, so as to analogy…
• It can be seen that in the case of more than 2 categories, how many categories need to be classified by logistic regression

### 3. Handwritten digit recognition

• A total of 0-9, 10 numbers, need 10 classification
• Since the data set y gives the number `0,1,2...9`of , and `0/1`the label mark is required for logistic regression, it is necessary to process y
• Let’s talk about the data set. The first `500`one is `0`, `500-1000`yes `1`, `...`, so as shown in the figure below, after processing `y`, the first column of the first 500 rows is 1, and the rest are 0,500-1000. The second column is 1, and the rest are 0...
• Then call the gradient descent algorithm to solve`theta`
• Implementation code:
``````# 求每个分类的theta，最后返回所有的all_theta
def oneVsAll(X,y,num_labels,Lambda):
# 初始化变量
m,n = X.shape
all_theta = np.zeros((n+1,num_labels))  # 每一列对应相应分类的theta,共10列
X = np.hstack((np.ones((m,1)),X))       # X前补上一列1的偏置bias
class_y = np.zeros((m,num_labels))      # 数据的y对应0-9，需要映射为0/1的关系
initial_theta = np.zeros((n+1,1))       # 初始化一个分类的theta
``````# 映射y
for i in range(num_labels):
class_y[:,i] = np.int32(y==i).reshape(1,-1) # 注意reshape(1,-1)才可以赋值

#np.savetxt("class_y.csv", class_y[0:600,:], delimiter=',')

'''遍历每个分类，计算对应的theta值'''
for i in range(num_labels):
result = optimize.fmin_bfgs(costFunction, initial_theta, fprime=gradient, args=(X,class_y[:,i],Lambda)) # 调用梯度下降的优化方法
all_theta[:,i] = result.reshape(1,-1)   # 放入all_theta中

all_theta = np.transpose(all_theta)
return all_theta
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17
• 18
• 19
• 20
• 21
• 22

### 4. Forecast

• As mentioned before, the predicted result is a probability value , and the `theta`learned value is substituted into the predicted S-shaped function . The maximum value of each row is the maximum probability of a certain number, and the column number is the real value of the predicted number. Because when classifying, all those `0`that are will `y`be mapped to the first column, those that are 1 will be mapped to the second column, and so on
• Implementation code:
``````# 预测
def predict_oneVsAll(all_theta,X):
m = X.shape[0]
num_labels = all_theta.shape[0]
p = np.zeros((m,1))
X = np.hstack((np.ones((m,1)),X))   #在X最前面加一列1
``````h = sigmoid(np.dot(X,np.transpose(all_theta)))  #预测

'''

- np.max(h, axis=1)返回h中每一行的最大值（是某个数字的最大概率）
- 最后where找到的最大概率所在的列号（列号即是对应的数字）
'''
p = np.array(np.where(h[0,:] == np.max(h, axis=1)[0]))
for i in np.arange(1, m):
t = np.array(np.where(h[i,:] == np.max(h, axis=1)[i]))
p = np.vstack((p,t))
return p
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17
• 18
• 19

### 5. Running results

• Classification 10, accuracy on training set:

### 6. Implemented using the logistic regression model in the scikit-learn library

• 1. Import package
``````from scipy import io as spio
import numpy as np
from sklearn import svm
from sklearn.linear_model import LogisticRegression
```

1

2

3

4

```
• 2. Load data
``````    data = loadmat_data("data_digits.mat")
X = data['X']   # 获取X数据，每一行对应一个数字20x20px
y = data['y']   # 这里读取mat文件y的shape=(5000, 1)
y = np.ravel(y) # 调用sklearn需要转化成一维的(5000,)
```

1

2

3

4

```
• 3. Fitting model
``````    model = LogisticRegression()
model.fit(X, y) # 拟合
```

1

2

```
• 4. Forecast
``````    predict = model.predict(X) #预测
``````print u"预测准确度为：%f%%"%np.mean(np.float64(predict == y)*100)
```
```
• 1
• 2
• 3
• 5. Output results (accuracy on the training set)

## 3. BP neural network

### 1. Neural network model

• First introduce a three-layer neural network, as shown in the figure below
• The input layer (input layer) has three units ( for the added bias, usually set to `1`)
• `j`Represents `i`the stimulus of the th layer, also known as the unit unit
• The weight matrix mapped from `j`layer to layer is the weight of each edge`j+1`
• So you can get:
• hidden layer:

• In the output layer , the S-type function also becomes the activation function
• It can be seen that is a 3x4 matrix and a 1x4 matrix
• `j+1`The number of units of == " x ( `j`the number of units of the layer + 1)

### 2. Cost function

• Assuming that the last output means that there are K units in the output layer
• Among them, represents `i`the unit output
• Similar to the cost function of logistic regression , it is to accumulate each output (a total of K outputs)

### 3. Regularization

• `L`–> Number of all layers
• –> The number of units in the first `l`layer
• The regularized cost function is
• with `L-1`layers ,
• Then accumulate the theta matrix corresponding to each layer, pay attention not to include theta(0) corresponding to the bias item
• Regularized cost function implementation code:
``````# 代价函数
def nnCostFunction(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda):
length = nn_params.shape[0] # theta的中长度
# 还原theta1和theta2
Theta1 = nn_params[0:hidden_layer_size*(input_layer_size+1)].reshape(hidden_layer_size,input_layer_size+1)
Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1)
``````# np.savetxt("Theta1.csv",Theta1,delimiter=',')

m = X.shape[0]
class_y = np.zeros((m,num_labels))      # 数据的y对应0-9，需要映射为0/1的关系
# 映射y
for i in range(num_labels):
class_y[:,i] = np.int32(y==i).reshape(1,-1) # 注意reshape(1,-1)才可以赋值

'''去掉theta1和theta2的第一列，因为正则化时从1开始'''
Theta1_colCount = Theta1.shape[1]
Theta1_x = Theta1[:,1:Theta1_colCount]
Theta2_colCount = Theta2.shape[1]
Theta2_x = Theta2[:,1:Theta2_colCount]
# 正则化向theta^2
term = np.dot(np.transpose(np.vstack((Theta1_x.reshape(-1,1),Theta2_x.reshape(-1,1)))),np.vstack((Theta1_x.reshape(-1,1),Theta2_x.reshape(-1,1))))

'''正向传播,每次需要补上一列1的偏置bias'''
a1 = np.hstack((np.ones((m,1)),X))
z2 = np.dot(a1,np.transpose(Theta1))
a2 = sigmoid(z2)
a2 = np.hstack((np.ones((m,1)),a2))
z3 = np.dot(a2,np.transpose(Theta2))
h  = sigmoid(z3)
'''代价'''
J = -(np.dot(np.transpose(class_y.reshape(-1,1)),np.log(h.reshape(-1,1)))+np.dot(np.transpose(1-class_y.reshape(-1,1)),np.log(1-h.reshape(-1,1)))-Lambda*term/2)/m

return np.ravel(J)
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17
• 18
• 19
• 20
• 21
• 22
• 23
• 24
• 25
• 26
• 27
• 28
• 29
• 30
• 31
• 32
• 33
• 34

### 4. Backpropagation BP

• The above forward propagation can be calculated `J(θ)`, using the gradient descent method also needs to ask for its gradient
• The purpose of BP backpropagation is to find the gradient of the cost function
• Assuming a 4-layer neural network, recorded as –> the error of the`l` unit of the layer`j`
• No , because there is no error for the input
• The process of backpropagation to calculate the gradient is:
• Finally , the gradient of the cost function is obtained
• Implementation code:
``````# 梯度
def nnGradient(nn_params,input_layer_size,hidden_layer_size,num_labels,X,y,Lambda):
length = nn_params.shape[0]
Theta1 = nn_params[0:hidden_layer_size*(input_layer_size+1)].reshape(hidden_layer_size,input_layer_size+1).copy()   # 这里使用copy函数，否则下面修改Theta的值，nn_params也会一起修改
Theta2 = nn_params[hidden_layer_size*(input_layer_size+1):length].reshape(num_labels,hidden_layer_size+1).copy()
m = X.shape[0]
class_y = np.zeros((m,num_labels))      # 数据的y对应0-9，需要映射为0/1的关系
# 映射y
for i in range(num_labels):
class_y[:,i] = np.int32(y==i).reshape(1,-1) # 注意reshape(1,-1)才可以赋值
``````'''去掉theta1和theta2的第一列，因为正则化时从1开始'''
Theta1_colCount = Theta1.shape[1]
Theta1_x = Theta1[:,1:Theta1_colCount]
Theta2_colCount = Theta2.shape[1]
Theta2_x = Theta2[:,1:Theta2_colCount]

Theta1_grad = np.zeros((Theta1.shape))  #第一层到第二层的权重
Theta2_grad = np.zeros((Theta2.shape))  #第二层到第三层的权重

'''正向传播，每次需要补上一列1的偏置bias'''
a1 = np.hstack((np.ones((m,1)),X))
z2 = np.dot(a1,np.transpose(Theta1))
a2 = sigmoid(z2)
a2 = np.hstack((np.ones((m,1)),a2))
z3 = np.dot(a2,np.transpose(Theta2))
h  = sigmoid(z3)

'''反向传播，delta为误差，'''
delta3 = np.zeros((m,num_labels))
delta2 = np.zeros((m,hidden_layer_size))
for i in range(m):
#delta3[i,:] = (h[i,:]-class_y[i,:])*sigmoidGradient(z3[i,:])  # 均方误差的误差率
delta3[i,:] = h[i,:]-class_y[i,:]                              # 交叉熵误差率
Theta2_grad = Theta2_grad+np.dot(np.transpose(delta3[i,:].reshape(1,-1)),a2[i,:].reshape(1,-1))
delta2[i,:] = np.dot(delta3[i,:].reshape(1,-1),Theta2_x)*sigmoidGradient(z2[i,:])
Theta1_grad = Theta1_grad+np.dot(np.transpose(delta2[i,:].reshape(1,-1)),a1[i,:].reshape(1,-1))

Theta1[:,0] = 0
Theta2[:,0] = 0
'''梯度'''
grad = (np.vstack((Theta1_grad.reshape(-1,1),Theta2_grad.reshape(-1,1)))+Lambda*np.vstack((Theta1.reshape(-1,1),Theta2.reshape(-1,1))))/m
return np.ravel(grad)
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17
• 18
• 19
• 20
• 21
• 22
• 23
• 24
• 25
• 26
• 27
• 28
• 29
• 30
• 31
• 32
• 33
• 34
• 35
• 36
• 37
• 38
• 39
• 40
• 41
• 42
• 43
• 44
• 45

### 5. The reason why BP can find the gradient

• In fact, `链式求导`the law
• Because the units of the next layer use the units of the previous layer as input to calculate
• The general derivation process is as follows. In the end, we want to predict that the function is `y`very , and the gradient of the mean square error can minimize the cost function along this gradient direction. It can be compared with the process of calculating the gradient above.
• Find a more detailed derivation of the error:

### 6. Gradient check

• Check whether the gradient obtained by `BP`using is correct
• Verify using the definition of the derivative:
• The calculated numerical gradient should be very close to the gradient calculated by BP
• After verifying that the BP is correct, there is no need to execute the algorithm for verifying the gradient
• Implementation code:
``````# 检验梯度是否计算正确
# 检验梯度是否计算正确
def checkGradient(Lambda = 0):
'''构造一个小型的神经网络验证，因为数值法计算梯度很浪费时间，而且验证正确后之后就不再需要验证了'''
input_layer_size = 3
hidden_layer_size = 5
num_labels = 3
m = 5
initial_Theta1 = debugInitializeWeights(input_layer_size,hidden_layer_size);
initial_Theta2 = debugInitializeWeights(hidden_layer_size,num_labels)
X = debugInitializeWeights(input_layer_size-1,m)
y = 1+np.transpose(np.mod(np.arange(1,m+1), num_labels))# 初始化y
``````y = y.reshape(-1,1)
nn_params = np.vstack((initial_Theta1.reshape(-1,1),initial_Theta2.reshape(-1,1)))  #展开theta
'''BP求出梯度'''
grad = nnGradient(nn_params, input_layer_size, hidden_layer_size,
num_labels, X, y, Lambda)
'''使用数值法计算梯度'''
num_grad = np.zeros((nn_params.shape[0]))
step = np.zeros((nn_params.shape[0]))
e = 1e-4
for i in range(nn_params.shape[0]):
step[i] = e
loss1 = nnCostFunction(nn_params-step.reshape(-1,1), input_layer_size, hidden_layer_size,
num_labels, X, y,
Lambda)
loss2 = nnCostFunction(nn_params+step.reshape(-1,1), input_layer_size, hidden_layer_size,
num_labels, X, y,
Lambda)
num_grad[i] = (loss2-loss1)/(2*e)
step[i]=0
# 显示两列比较
res = np.hstack((num_grad.reshape(-1,1),grad.reshape(-1,1)))
print res
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17
• 18
• 19
• 20
• 21
• 22
• 23
• 24
• 25
• 26
• 27
• 28
• 29
• 30
• 31
• 32
• 33
• 34
• 35

### 7. Random initialization of weights

• The neural network cannot `theta`be `0`, because if the weight of each edge is 0, each neuron will have the same output, and the same gradient will be obtained in the backpropagation, and only one result will be predicted in the end .
• So it should be initialized to a number close to 0
• Implementation code
``````# 随机初始化权重theta
def randInitializeWeights(L_in,L_out):
W = np.zeros((L_out,1+L_in))    # 对应theta的权重
epsilon_init = (6.0/(L_out+L_in))**0.5
W = np.random.rand(L_out,1+L_in)*2*epsilon_init-epsilon_init # np.random.rand(L_out,1+L_in)产生L_out*(1+L_in)大小的随机矩阵
return W
```

1

2

3

4

5

6

```

### 8. Forecast

• Forward propagating prediction results
• Implementation code
``````# 预测
def predict(Theta1,Theta2,X):
m = X.shape[0]
num_labels = Theta2.shape[0]
#p = np.zeros((m,1))
'''正向传播，预测结果'''
X = np.hstack((np.ones((m,1)),X))
h1 = sigmoid(np.dot(X,np.transpose(Theta1)))
h1 = np.hstack((np.ones((m,1)),h1))
h2 = sigmoid(np.dot(h1,np.transpose(Theta2)))
``````'''

- np.max(h, axis=1)返回h中每一行的最大值（是某个数字的最大概率）
- 最后where找到的最大概率所在的列号（列号即是对应的数字）
'''
#np.savetxt("h2.csv",h2,delimiter=',')
p = np.array(np.where(h2[0,:] == np.max(h2, axis=1)[0]))
for i in np.arange(1, m):
t = np.array(np.where(h2[i,:] == np.max(h2, axis=1)[i]))
p = np.vstack((p,t))
return p
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17
• 18
• 19
• 20
• 21
• 22

### 9. Output result

• Gradient check:
• Randomly display 100 handwritten numbers
• Show theta1 weights
• Training set prediction accuracy
• Normalized training set prediction accuracy

## 4. SVM support vector machine

### 1. Cost function

• In logistic regression, our cost is:
• As shown in the figure, if `y=1`the `cost`cost function is shown in the figure

• `y=0`At the same time , replace
• The final cost function is:
Finally we want
• The cost function in our logistic regression before is:
It can be considered that what is here is just a matter of expression form. The larger the value `C`here , `margin`the larger the decision boundary of SVM will be. The following will explain

### 2、Large Margin

• As shown in the figure below, the SVM classification will use the largest `margin`to separate it
• First talk about the vector inner product
• `向量V``向量u`The length of the projection on is recorded as `p`, then: vector inner product:

just deduce it according to the vector angle formula,
• As mentioned earlier, when `C`is bigger , `margin`it is bigger. Our goal is to minimize the cost function `J(θ)`. When is `margin`the largest , `C`the product term should be very small, so the approximation is:
,
our final goal is to minimize the cost`θ`

• It can be obtained by :
, `p`which `x`is `θ`the projection on
• As shown in the figure below, assuming that the decision boundary is shown in the figure, find a point in `θ`it projection to is `p`, or , if is `p`small , it needs to be very large, and the final result is`large margin`

### 3. SVM Kernel (kernel function)

• Assuming several points in the figure,

make

• `σ`The smaller the Gaussian kernel function , `f`the faster the decline

• Minimize `J`Find `θ`,
• If , == "forecast`y=1`

### 4. `scikit-learn`Use `SVM`the model code in

• all codes
• Linearly separable, the specified kernel function is `linear`:
``````    '''data1——线性分类'''
data1 = spio.loadmat('data1.mat')
X = data1['X']
y = data1['y']
y = np.ravel(y)
plot_data(X,y)
``````model = svm.SVC(C=1.0,kernel='linear').fit(X,y) # 指定核函数为线性核函数
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• Nonlinearly separable, the default kernel function is`rbf`
``````    '''data2——非线性分类'''
data2 = spio.loadmat('data2.mat')
X = data2['X']
y = data2['y']
y = np.ravel(y)
plt = plot_data(X,y)
plt.show()
``````model = svm.SVC(gamma=100).fit(X,y)     # gamma为核函数的系数，值越大拟合的越好
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9

### 5. Running results

• Linearly separable decision boundaries:
• Linearly inseparable decision boundaries:

## Five, K-Means clustering algorithm

### 1. Clustering process

• Clustering belongs to unsupervised learning, and the marks that do not know y are divided into K classes

• The K-Means algorithm is divided into two steps

• The first step: cluster allocation, randomly select `K`a point as the center, calculate `K`the distance to this point, and divide it `K`into clusters

• Step 2: Move the cluster center: recalculate the center of each cluster , move the center, and repeat the above steps.

• As shown below:

• Randomly assigned cluster centers

• Recalculate the cluster center and move it once

• Cluster centers after the last `10`steps

• Calculate each piece of data to which center has recently implemented the code:

``````# 找到每条数据距离哪个类中心最近
def findClosestCentroids(X,initial_centroids):
m = X.shape[0]                  # 数据条数
K = initial_centroids.shape[0]  # 类的总数
dis = np.zeros((m,K))           # 存储计算每个点分别到K个类的距离
idx = np.zeros((m,1))           # 要返回的每条数据属于哪个类
``````'''计算每个点到每个类中心的距离'''
for i in range(m):
for j in range(K):
dis[i,j] = np.dot((X[i,:]-initial_centroids[j,:]).reshape(1,-1),(X[i,:]-initial_centroids[j,:]).reshape(-1,1))

'''返回dis每一行的最小值对应的列号，即为对应的类别
- np.min(dis, axis=1)返回每一行的最小值
- np.where(dis == np.min(dis, axis=1).reshape(-1,1)) 返回对应最小值的坐标
- 注意：可能最小值对应的坐标有多个，where都会找出来，所以返回时返回前m个需要的即可（因为对于多个最小值，属于哪个类别都可以）
'''
dummy,idx = np.where(dis == np.min(dis, axis=1).reshape(-1,1))
return idx[0:dis.shape[0]]  # 注意截取一下
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17
• 18
• 19
• Computing class center implementation code:
``````# 计算类中心
def computerCentroids(X,idx,K):
n = X.shape[1]
centroids = np.zeros((K,n))
for i in range(K):
centroids[i,:] = np.mean(X[np.ravel(idx==i),:], axis=0).reshape(1,-1)   # 索引要是一维的,axis=0为每一列，idx==i一次找出属于哪一类的，然后计算均值
return centroids
```

1

2

3

4

5

6

7

```

### 2. Objective function

• Also known as the distortion cost function
• Finally we want to get:
• Among them, it indicates which class center the `i`th data is closest to,
• where is the cluster center

### 3. Selection of cluster centers

• Random initialization, randomly select K from the given data as the cluster center
• The result of random one time may not be good, you can randomize multiple times, and finally take the one that minimizes the cost function as the center
• Implementation code: (here randomly once)
``````# 初始化类中心--随机取K个点作为聚类中心
def kMeansInitCentroids(X,K):
m = X.shape[0]
m_arr = np.arange(0,m)      # 生成0-m-1
centroids = np.zeros((K,X.shape[1]))
np.random.shuffle(m_arr)    # 打乱m_arr顺序
rand_indices = m_arr[:K]    # 取前K个
centroids = X[rand_indices,:]
return centroids
```

1

2

3

4

5

6

7

8

9

```

### 4. Selection of the number of clusters K

• Clustering does not know the label of y, so it does not know the real number of clusters
• Elbow method
• `J`Make `K`the graph of the cost function and , if there is an inflection point, as shown in the figure below, `K`take the value at the inflection point, the figure below at this time`K=3`
• If it is smooth, it is not clear, and it is an artificial choice.
• The second is human observation selection

### 5. Application - image compression

• Divide the pixels of the picture into several categories, and then use this category to replace the original pixel value
• Algorithm code to perform clustering:
``````# 聚类算法
def runKMeans(X,initial_centroids,max_iters,plot_process):
m,n = X.shape                   # 数据条数和维度
K = initial_centroids.shape[0]  # 类数
centroids = initial_centroids   # 记录当前类中心
previous_centroids = centroids  # 记录上一次类中心
idx = np.zeros((m,1))           # 每条数据属于哪个类
``````for i in range(max_iters):      # 迭代次数
print u'迭代计算次数：%d'%(i+1)
idx = findClosestCentroids(X, centroids)
if plot_process:    # 如果绘制图像
plt = plotProcessKMeans(X,centroids,previous_centroids) # 画聚类中心的移动过程
previous_centroids = centroids  # 重置
centroids = computerCentroids(X, idx, K)    # 重新计算类中心
if plot_process:    # 显示最终的绘制结果
plt.show()
return centroids,idx    # 返回聚类中心和数据属于哪个类
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• 13
• 14
• 15
• 16
• 17
• 18

### 6. Use the linear model in the scikit-learn library to achieve clustering

• import package
``````    from sklearn.cluster import KMeans
```

1

```
• Fitting a Model to Data
``````    model = KMeans(n_clusters=3).fit(X) # n_clusters指定3类，拟合数据
```

1

```
• cluster center
``````    centroids = model.cluster_centers_  # 聚类中心
```

1

```

### 7. Running results

• Movement of 2D data class centers
• Image Compression

## 6. PCA Principal Component Analysis (Dimensionality Reduction)

### 1. Use

• Data Compression (Data Compression), make the program run faster
• Visualize data, e.g. `3D-->2D`etc.

### 2、2D–>1D，nD–>kD

• As shown in the figure below, all data points can be projected onto a straight line, which is the smallest sum of squares of projection distances (projection error)
• Note that the data needs to `归一化`be processed
• The idea is to find `1`one `向量u`, and all the data is projected onto it to minimize the projection distance
• Then `nD-->kD`is to find `k`a vector on which all the data is projected to minimize the projection error
• eg: 3D–>2D, 2 vectors represent a plane, and the projection error of all points projected onto this plane is minimal

### 3. The difference between PCA and linear regression

• Linear regression is to `x`find `y`the relationship with and then use it for prediction`y`
• `PCA`It is to find a projection surface and minimize the projection error from data to this projection surface

### 4. PCA dimensionality reduction process

• Data preprocessing (mean normalization)
• It is to subtract the mean value of the corresponding feature, and then divide it by the standard deviation of the corresponding feature (it can also be the maximum value-minimum value)
• Implementation code:
``````    # 归一化数据
def featureNormalize(X):
'''（每一个数据-当前列的均值）/当前列的标准差'''
n = X.shape[1]
mu = np.zeros((1,n));
sigma = np.zeros((1,n))
``````   mu = np.mean(X,axis=0)
sigma = np.std(X,axis=0)
for i in range(n):
X[:,i] = (X[:,i]-mu[i])/sigma[i]
return X,mu,sigma
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9
• 10
• 11
• 12
• Note that `Σ`the and
• Covariance matrix `对称正定`(if you don't understand positive definite, look at the line generation)
• of size `nxn`, `n`is the dimension `feature`of
• Implementation code:
``````Sigma = np.dot(np.transpose(X_norm),X_norm)/m  # 求Sigma
```

1

```
• `Σ`Compute the eigenvalues ​​and eigenvectors of
• This can be done using the `svd`singular value decomposition function:`U,S,V = svd(Σ)`
• What is returned is a diagonal matrix of `Σ`the same size `S`(consisting `Σ`of eigenvalues) [ Note : `matlab`The function returns a diagonal matrix, and a vector is returned `python`in to save space]
• There are also two unitary matrices U and V
• Note : `svd`The functions calculated `S`are arranged in descending order of eigenvalues. If not used `svd`, they need to be rearranged according to the size of eigenvalues`U`
• Dimensionality reduction
• `U`Select the first `K`columns in (assuming you want to reduce to `K`dimension )
• `Z`It corresponds to the data after dimensionality reduction
• Implementation code:
``````    # 映射数据
def projectData(X_norm,U,K):
Z = np.zeros((X_norm.shape[0],K))
``````   U_reduce = U[:,0:K]          # 取前K个
Z = np.dot(X_norm,U_reduce)
return Z
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• Process summary:
• `Sigma = X'*X/m`
• `U,S,V = svd(Sigma)`
• `Ureduce = U[:,0:k]`
• `Z = Ureduce'*x`

### 5. Data recovery

• Implementation code:
``````    # 恢复数据
def recoverData(Z,U,K):
X_rec = np.zeros((Z.shape[0],U.shape[0]))
U_recude = U[:,0:K]
X_rec = np.dot(Z,np.transpose(U_recude))  # 还原数据（近似）
return X_rec
```

1

2

3

4

5

6

```

### 6. Selection of the number of principal components (that is, the dimension to be reduced)

• how to choose
• The error rate is generally taken `1%，5%，10%`as
• How to achieve
• It would be too expensive to try one by one
• You can `K`try .

### 7. Suggestions for use

• Don't use PCA to solve the overfitting problem `Overfitting`, or use the regularization method (if you keep a high difference, it's still possible)
• Consider using PCA only if you have good results on the original data, but run very slowly

### 8. Running results

• 2D data reduced to 1D

• the direction to project

• 2D reduced to 1D and the corresponding relationship

• Face Data Dimensionality Reduction

• Raw data

• Visualization section `U`Matrix information

• Data recovery

### 9. Use PCA in the scikit-learn library to achieve dimensionality reduction

• Import the required packages:
``````# -*- coding: utf-8 -*-
# Author:bob
# Date:2016.12.22
import numpy as np
from matplotlib import pyplot as plt
from scipy import io as spio
from sklearn.decomposition import pca
from sklearn.preprocessing import StandardScaler
```

1

2

3

4

5

6

7

8

```
• normalized data
``````    '''归一化数据并作图'''
scaler = StandardScaler()
scaler.fit(X)
x_train = scaler.transform(X)
```

1

2

3

4

```
• Fit data with PCA model and reduce dimensionality
• `n_components`corresponds to the dimension to be
``````    '''拟合数据'''
K=1 # 要降的维度
model = pca.PCA(n_components=K).fit(x_train)   # 拟合数据，n_components定义要降的维度
Z = model.transform(x_train)    # transform就会执行降维操作
```

1

2

3

4

```
• Data Recovery
• `model.components_`will get `U`the matrix
``````    '''数据恢复并作图'''
Ureduce = model.components_     # 得到降维用的Ureduce
x_rec = np.dot(Z,Ureduce)       # 数据恢复
```

1

2

3

```

## 7. Anomaly Detection

### 1. Gaussian distribution (normal distribution)`Gaussian distribution`

• where, is the mean`u` of the data , and is the standard deviation of the data`σ`
• `σ`The smaller the value , the sharper the corresponding image

### 2. Anomaly detection algorithm

• example
• Calculate `p(x)`, if `P(x)<ε`it is considered abnormal, where `ε`is the critical value of the probability we require`threshold`
• This is just a unit Gaussian distribution , assuming that `feature`the is independent, the following will talk about the multivariate Gaussian distribution , which will automatically capture the relationship `feature`between
• Parameter Estimation Implementation Code
``````# 参数估计函数（就是求均值和方差）
def estimateGaussian(X):
m,n = X.shape
mu = np.zeros((n,1))
sigma2 = np.zeros((n,1))
``````mu = np.mean(X, axis=0) # axis=0表示列，每列的均值
sigma2 = np.var(X,axis=0) # 求每列的方差
return mu,sigma2
```
```
• 1
• 2
• 3
• 4
• 5
• 6
• 7
• 8
• 9

### 3. The quality `p(x)`of and `ε`the selection of

• Error Measures for Skewed Data
• Because the data may be very skewed (that `y=1`is, the number of is very small, ( `y=1`indicating abnormality)), it can be used `Precision/Recall`to calculate `F1Score`(on the CV cross-validation set )
• For example: predicting cancer, assuming that the model can `99%`get `1%`the error rate that can predict the correct, but the probability of actual cancer is very small, only then `0.5%`we always predict that there is no cancer y=0, but can get a smaller error rate. It is not scientific to use `error rate`to evaluate.
• Record as shown below:
• Still take cancer prediction as an example, assuming that the predictions are all no-cancer, TN=199, FN=1, TP=0, FP=0, so: Precision=0/0, Recall=0/1=0, although accuracy= 199/200=99.5%, but not credible.
• `ε`selection
• Try multiple `ε`values ​​for `F1Score`high
• Implementation code
``````# 选择最优的epsilon，即：使F1Score最大
def selectThreshold(yval,pval):
'''初始化所需变量'''
bestEpsilon = 0.
bestF1 = 0.
F1 = 0.
step = (np.max(pval)-np.min(pval))/1000
'''计算'''
for epsilon in np.arange(np.min(pval),np.max(pval),step):
cvPrecision = pval<epsilon
tp = np.sum((cvPrecision == 1) & (yval == 1).ravel()).astype(float)  # sum求和是int型的，需要转为float
fp = np.sum((cvPrecision == 1) & (yval == 0).ravel()).astype(float)
fn = np.sum((cvPrecision == 0) & (yval == 1).ravel()).astype(float)
precision = tp/(tp+fp)  # 精准度
recision = tp/(tp+fn)   # 召回率
F1 = (2*precision*recision)/(precision+recision)  # F1Score计算公式
if F1 > bestF1:  # 修改最优的F1 Score
bestF1 = F1
bestEpsilon = epsilon
return bestEpsilon,bestF1
```

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

```

### 4. Choose what kind of feature to use (unit Gaussian distribution)

• If some data does not satisfy the Gaussian distribution, you can change the data, for example `log(x+C),x^(1/2)`, etc.
• If the value `p(x)`of is very large regardless of whether it is abnormal or not, you can try to combine multiple `feature`, (because there may be a relationship between features)

### 5. Multivariate Gaussian distribution

• Problems with unit Gaussian distribution

• As shown in the figure below, the red points are abnormal points, and the others are normal points (such as changes in CPU and memory)

• The Gaussian distribution corresponding to x1 is as follows:

• The Gaussian distribution corresponding to x2 is as follows:

• It can be seen that the corresponding p(x1) and p(x2) values ​​do not change much, so it will not be considered abnormal

• Because we think that the features are independent of each other, the above picture is expanded in a perfect circle

• For example:

Indicates that x1 and x2 are positively correlated , that is, the larger x1 is, the greater x2 will be, as shown in the figure below, and the red abnormal points can be checked out.

If:

it indicates that x1 and x2 are negatively correlated

• Implementation code:

``````# 多元高斯分布函数
def multivariateGaussian(X,mu,Sigma2):
k = len(mu)
if (Sigma2.shape[0]>1):
Sigma2 = np.diag(Sigma2)
'''多元高斯分布函数'''
X = X-mu
argu = (2*np.pi)**(-k/2)*np.linalg.det(Sigma2)**(-0.5)
p = argu*np.exp(-0.5*np.sum(np.dot(X,np.linalg.inv(Sigma2))*X,axis=1))  # axis表示每行
return p
```

1

2

3

4

5

6

7

8

9

10

```

### 6. Unit and multivariate Gaussian distribution characteristics

• Unit Gaussian distribution
• It can be used when artificially capturing the relationship `feature`between
• small amount of calculation
• Multivariate Gaussian distribution
• Automatically capture related features
• `m>n`or can be used when `Σ`reversible . (If irreversible, there may be redundant x, because linear correlation, irreversible, or m<n)

### 7. Program running results

• Display Data
• contour line
• Anomaly labeling

### Guess you like

Origin blog.csdn.net/kuwola/article/details/122654326
Recommended
Ranking
Daily