Previous we achieved ADALINE using a gradient descent method, this method uses all the training samples to update the weight vector can also be called batch gradient descent (batch gradient descent). Suppose now that we have a large number of sample data sets, such as one million samples, so if we now use to train a batch gradient descent model, each updated weight vector, we have to use one million samples, training time is very long, very efficient low, we can find a way to use both gradient descent, but they do not have to use weights every update to all the samples, so they were randomly presented to the gradient descent method (stochastic gradient descent).
Stochastic gradient descent method can only be used to update the weight vector using a training sample:
\ [\ ETA (Y ^ I- \ Phi (Z ^ I)) X ^ I \]
This method than the batch method Gradient Descent more fast, because it can be more frequent updates weight vector, and when the samples using updated weights, compared to the use of all of the samples to newer and more random, the algorithm will help to avoid falling into local minimum, using this method be sure to pay attention to randomly selected when selecting a sample to be updated, must disrupt the order of all samples before each iteration to ensure the randomness of training and learning rate in training is not fixed, as can be iterative increase in the number, learning rate decreases, this method can contribute to convergence.
Now we have to use all the samples of batch gradient descent, has also been using a single sample of stochastic gradient descent, so a compromise approach, called the minimum batch of study (mini-batch learning), each part of the training to use it sample to update the weight vector.
Next, we implemented using stochastic gradient descent method of Adaline
from numpy.random import seed
class AdalineSGD(object):
"""ADAptive LInear NEuron classifier.
Parameters
----------
eta:float
Learning rate(between 0.0 and 1.0
n_iter:int
Passes over the training dataset.
Attributes
----------
w_: 1d-array
weights after fitting.
errors_: list
Number of miscalssifications in every epoch.
shuffle:bool(default: True)
Shuffle training data every epoch
if True to prevent cycles.
random_state: int(default: None)
Set random state for shuffling
and initalizing the weights.
"""
def __init__(self, eta=0.01, n_iter=10, shuffle=True, random_state=None):
self.eta = eta
self.n_iter = n_iter
self.w_initialized = False
self.shuffle = shuffle
if random_state:
seed(random_state)
def fit(self, X, y):
"""Fit training data.
:param X:{array-like}, shape=[n_samples, n_features]
:param y: array-like, shape=[n_samples]
:return:
self:object
"""
self._initialize_weights(X.shape[1])
self.cost_ = []
for i in range(self.n_iter):
if self.shuffle:
X, y = self._shuffle(X, y)
cost = []
for xi, target in zip(X, y):
cost.append(self._update_weights(xi, target))
avg_cost = sum(cost)/len(y)
self.cost_.append(avg_cost)
return self
def partial_fit(self, X, y):
"""Fit training data without reinitializing the weights."""
if not self.w_initialized:
self._initialize_weights(X.shape[1])
if y.ravel().shape[0] > 1:
for xi, target in zip(X, y):
self._update_weights(xi, target)
else:
self._update_weights(X, y)
return self
def _shuffle(self, X, y):
"""Shuffle training data"""
r = np.random.permutation(len(y))
return X[r], y[r]
def _initialize_weights(self, m):
"""Initialize weights to zeros"""
self.w_ = np.zeros(1 + m)
self.w_initialized = True
def _update_weights(self, xi, target):
"""Apply Adaline learning rule to update the weights"""
output = self.net_input(xi)
error = (target - output)
self.w_[1:] += self.eta * xi.dot(error)
self.w_[0] += self.eta * error
cost = 0.5 * error ** 2
return cost
def net_input(self, X):
"""Calculate net input"""
return np.dot(X, self.w_[1:]) + self.w_[0]
def activation(self, X):
"""Computer linear activation"""
return self.net_input(X)
def predict(self, X):
"""Return class label after unit step"""
return np.where(self.activation(X) >= 0.0, 1, -1)
Wherein _shuffle method, the permutation function call numpy.random obtain a random sequence of 0 to 100, and this sequence as a subscript category feature vectors and matrices, can disrupt the function as the sample order.
Now to start training
ada = AdalineSGD(n_iter=15, eta=0.01, random_state=1)
ada.fit(X_std, y)
Training and draw a graph of FIG boundary
plot_decision_region(X_std, y, classifier=ada)
plt.title('Adaline - Stochastic Gradient Desent')
plt.xlabel('sepal length [standardized]')
plt.ylabel('petal length [standardized]')
plt.legend(loc = 'upper left')
plt.show()
plt.plot(range(1, len(ada.cost_) + 1), ada.cost_, marker='o')
plt.xlabel('Epochs')
plt.ylabel('Average Cost')
plt.show()
As can be seen from the figure, the average loss decreases rapidly, after about 15 iterations of the dividing line and use batch gradient descent Adaline boundary is similar.