Active Learning (Active Learning, AL)

Record the learning process and briefly introduce the understanding of active learning


1. Why use active learning?

cut costs

For supervised learning tasks, there is a problem that the cost of labeling is relatively expensive and labels are difficult to obtain in large quantities.

In the context of this problem, Active Learning (AL) attempts to train better-performing models by selectively labeling less data .

For example, at this time we have a large amount of cat and dog data, assuming that it is very costly to manually mark their categories, then we need to use active learning to select some data from them to mark them to save costs, and use active models to select The accuracy of the model trained with the labeled data is higher than that of the model trained with the same amount of labeled data randomly selected.

insert image description here
In the case of the same amount of data, the model trained by using samples selected by active learning performs better than the model trained by randomly selected models.

The steps of active learning

insert image description here
The key to active learning is the model you choose, the uncertainty measure you use, and the query strategy you apply to request annotations.

Collect data & select models

To start, select a certain amount of data for labeling, and select the model we need to train (such as a logistic regression model).

training model

Put the marked data into the model for training, and the accuracy will not be very high at this time.

Judging whether the accuracy meets the requirements

If the accuracy meets the requirements (such as 99%), it means that the model and training are good and can be applied.
If the accuracy does not meet the requirements (for example, only 12%), it means that the model has not been trained well. This requires the use of active learning to select the most useful data for improving the accuracy of the model for manual labeling.

Define query strategy

Including the uncertainty of the measurement prediction and the query strategy applied to the request for labeling, return the data that needs to be labeled selected by the strategy, perform manual labeling, and go to step 2.

The query strategy determines which samples are worth labeling, which can greatly save our labeling cost.

def custom_query_strategy(classifer, X):
  utility = utility_measure(classifer, X) #度量预测的不确定度(选择的依据)
  query_idx = select_instances(utility) #应用于请求标注的查询策略
  return query_idx,X[query_idx]

The soul of the query strategy is the utility function

code flow

insert image description here

"""
Active regression example with Gaussian processes.
"""

import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF
from modAL.models import ActiveLearner


# query strategy for regression
def GP_regression_std(regressor, X):
    _, std = regressor.predict(X, return_std=True)
    return np.argmax(std)


# generating the data
X = np.random.choice(np.linspace(0, 20, 10000), size=200, replace=False).reshape(-1, 1)
y = np.sin(X) + np.random.normal(scale=0.3, size=X.shape)

# assembling initial training set
n_initial = 5
initial_idx = np.random.choice(range(len(X)), size=n_initial, replace=False)
X_initial, y_initial = X[initial_idx], y[initial_idx]

# defining the kernel for the Gaussian process
kernel = RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
         + WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))

# initializing the active learner
regressor = ActiveLearner(
    estimator=GaussianProcessRegressor(kernel=kernel),
    query_strategy=GP_regression_std,
    X_training=X_initial.reshape(-1, 1), y_training=y_initial.reshape(-1, 1)
)

# plotting the initial estimation
with plt.style.context('seaborn-white'):
    plt.figure(figsize=(14, 7))
    x = np.linspace(0, 20, 1000)
    pred, std = regressor.predict(x.reshape(-1,1), return_std=True)
    plt.plot(x, pred)
    plt.fill_between(x, pred.reshape(-1, )-std, pred.reshape(-1, )+std, alpha=0.2)
    plt.scatter(X, y, c='k')
    plt.title('Initial estimation based on %d points' % n_initial)
    plt.show()

# active learning
n_queries = 10
for idx in range(n_queries):
    query_idx, query_instance = regressor.query(X)
    regressor.teach(X[query_idx].reshape(1, -1), y[query_idx].reshape(1, -1))

# plotting after active learning
with plt.style.context('seaborn-white'):
    plt.figure(figsize=(14, 7))
    x = np.linspace(0, 20, 1000)
    pred, std = regressor.predict(x.reshape(-1,1), return_std=True)
    plt.plot(x, pred)
    plt.fill_between(x, pred.reshape(-1, )-std, pred.reshape(-1, )+std, alpha=0.2)
    plt.scatter(X, y, c='k')
    plt.title('Estimation after %d queries' % n_queries)
    plt.show()

Predict the trend of 1000 samples from 15 samples using active learning.
insert image description here

For specific ipynb files, see the author's warehouse

other

Code sample
This is on github, there are many code samples in it, including the one in the article

conda activate pytorch
pip install --user  modAL

Guess you like

Origin blog.csdn.net/weixin_62501745/article/details/128795230