python machine learning - ADALINE

Part blog we say perception is, this blog is mainly recorded ADALINE the algorithm and some other training details, ADALINE (referred to as Adaline) proposed by Bernard Widrow and his doctoral Tedd Hoff on perceptron algorithm is improved.

Adaline course of the input vector x and the perceptual process are the same, are used with a weight vector w to give the linear combination x z, z-activation function and then compressed using a binary input (1 / -1), except that w Adaline updated using a gradient descent method.

Since our aim is to accurately classify, then we need to measure the quality of the classification results, we introduce here the objective function:
\ [J (W) = \ frac12 \ n-sum_i ^ (I ^ Y - \ Phi (Z ^ I )) ^ 2 \]
it can also be called a loss function, by the formula we can understand why generally called a loss of function, which can be calculated squared error between the actual and predicted values for all the training samples, and (Sum of squared errors , referred to as SSE), the front half of the equation that is easy to add after the derivation of no other meaning.

With the loss function, so our aim more specific, is to choose the right w, the loss function to obtain a minimum, the smaller the loss function, it means fewer cases of misclassification, the better the effect of the algorithm classification . And because Adaline loss function is a convex function, so we can use gradient descent to find the minimum value of the loss function of the weight vector w, we can imagine a ball rolling down a hill:

W beginning may get a big loss function, but due to the loss function J is a function of w, and also a convex function, it exists a minimum, learned friend should know calculus to find a function most value, the general approach derivative is zero, that x is the most value solved by derivative and, where gradient descent i.e. derivation, but since w is a weight vector is multidimensional, it is necessary loss function partial derivative of w, w to obtain the partial derivatives of each component, and then update the entire w, specifically derived as follows:
\ [Note: w is a vector, w_j \\ w is a component of the vector w = w + \ Delta w \\\ Delta w = - \ eta \ Delta J (w) \\\ frac {\ partial J} {\ partial w_j} = \ frac {\ partial} {\ partial w_j} \ frac 12 \ sum_i (y ^ i- \ phi (z ^ i)) ^ 2 \\ = \ frac 12 \ frac {\ partial} {\ partial w_j} \ sum_i (y ^ i- \ phi (z ^ i)) ^ 2 \ \ = \ frac 12 \ sum_i2 ( y ^ i- \ phi (z ^ i)) \ frac {\ partial} {\ partial w_j} (y ^ i- \ phi (z ^ i)) \\ = \ sum_i ( y ^ i- \ phi (z ^ i)) \ frac {\ partial} {\ partial w_j} (y ^ i- \ sum_i (w_j ^ ix_j ^ i)) \\ = \ sum_i (y ^ i- \ phi (z ^ i)) (- x_j ^ i) \\ = - \ s um_i (y ^ i- \ phi ( z ^ i)) x_j ^ i \\ so \ Delta w_j = - \ eta \ frac {\ partial J} {\ partial w_j} = \ eta \ sum_i (y ^ i- \ phi (z ^ i)) x_j
^ i \] it should be noted so that the weight component of the vector w is updated at the same time, and each update all of the training samples used, the gradient descent method is also known as bulk gradient decline (batch gradient descent)

Next we particularly achieved ADALINE, and since the perceptron learning rule is very similar, it is obtained directly modified on the basis of the perceptron, wherein the fit need to modify the method, because here we are using a gradient descent algorithm.

class AdalineGD(object):
    """ADAptive LInear NEuron classifier.

    Parameters
    ----------
    eta:float
        Learning rate(between 0.0 and 1.0
    n_iter:int
        Passes over the training dataset.

    Attributes
    ----------
    w_:1d-array
        weights after fitting.
    errors_:list
        Number of miscalssifications in every epoch.

    """

    def __init__(self, eta=0.01, n_iter=10):
        self.eta = eta
        self.n_iter = n_iter

    def fit(self, X, y):
        """Fit training data.

        :param X:{array-like}, shape=[n_samples, n_features]
        Training vectors,
        :param y: array-like, shape=[n_samples]
        Target values.
        :return:
        self:object

        """

        self.w_ = np.zeros(1 + X.shape[1]) # Add w_0
        self.cost_ = []

        for i in range(self.n_iter):
            output = self.net_input(X)
            errors = (y - output)
            self.w_[1:] += self.eta * X.T.dot(errors)
            self.w_[0] += self.eta * errors.sum()
            cost = (errors ** 2).sum() / 2.0
            self.cost_.append(cost)
        return self

    def net_input(self, X):
        """Calculate net input"""
        return np.dot(X, self.w_[1:]) + self.w_[0]
    
    def activation(self, X):
        """Computer linear activation"""
        return self.net_input(X)
    
    def predict(self, X):
        """Return class label after unit step"""
        return np.where(self.activation(X) >= 0.0, 1, -1)

Using different learning rates (0.01 and 0.0001) training, the learning process of neurons were observed. Where the learning rate, the number of iterations we call them super parameter (hyperparameters), we can manually set the appropriateness of ultra-parameter settings are important for the entire training process.

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8,4))
ada1 = AdalineGD(n_iter=10, eta=0.01).fit(X, y)
ax[0].plot(range(1, len(ada1.cost_) + 1), np.log10(ada1.cost_), marker='o')
ax[0].set_xlabel('Epochs')
ax[0].set_ylabel('log(Sum-squared-error)')
ax[0].set_title('Adaline - Learning rate 0.01')
ada2 = AdalineGD(n_iter=10, eta=0.0001).fit(X, y)
ax[1].plot(range(1, len(ada2.cost_) + 1), ada2.cost_, marker='o')
ax[1].set_xlabel('Epochs')
ax[1].set_ylabel('Sum-squared-error')
ax[1].set_title('Adaline - Learning rate 0.0001')
plt.show()

As can be seen, the left learning of 0.01, as the number of iterations increases, the error increases, indicating improper learning rate setting, resulting in great harm, the learning rate 0.0001 right, as the number of iterations the increase, decrease errors, but the reduction is too slow, very slow convergence, training efficiency is too low, so we can see that too large or too small learning rate is inappropriate.

As it can be seen from the right, if the learning rate is too large, it will lead to skip the corresponding minimum weight at each gradient descent the weight vector w, so that the algorithm can not converge.

Next, we describe a method for data preprocessing, feature some kind of training before the scaling operation, referred to herein we standardize features, all the features of the data can be scaled to make the average of 0 and variance 1, speed up the training model and avoid learning model is very distorted.

Specific formula is as follows:
\ [^ x_j, = \ {FRAC x_j- \ mu_j} {\ sigma_j} \]
specifically implemented as follows:

X_std = np.copy(X)
X_std[:, 0] = (X[:,0] - X[:,0].mean()) / X[:,0].std()
X_std[:, 1] = (X[:,1] - X[:,1].mean()) / X[:,1].std()

Data preprocessing has ended, then we start training model

ada = AdalineGD(n_iter=15, eta=0.01)
ada.fit(X_std, y)
plot_decision_region(X_std, y, classifier=ada)
plt.title('Adaline - Gradient Descent')
plt.xlabel('sepal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.legend(loc='upper left')
plt.show()

plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
plt.xlabel('Epoches')
plt.ylabel('Sum-squared_error')
plt.show()

Seen from the figure, as the number of iterations increases, the error decreases, although the study was 0.01, prior to the standardization, the algorithm does not converge, but after standardized algorithm eventually converges.

Guess you like

Origin www.cnblogs.com/Dzha/p/11846125.html