Active Learning (Optimal Experimental Design)

1. What is machine learning

Machine learning is an important research field of artificial intelligence, and it is also a relatively complex means of information processing . It has important applications in image processing, pattern recognition, data mining and other fields. Machine learning should have the ability to automatically extract information in databases and information systems and then convert it into knowledge , and automatically store it in the knowledge base.

In essence, machine learning is about algorithms that give machines the ability to learn. In many cases, these algorithms can summarize some of the given data, and derive information from these data attributes to make predictions for new data that will appear in the future.

2. Background of Active Learning

In the field of machine learning (Machine learning), supervised learning (Supervised learning), unsupervised learning (Unsupervised learning) and semi-supervised learning (Semi-supervised learning) are three types of learning techniques with more research and wider application. A brief description of these three types of learning is as follows:

  • Supervised learning : Through the corresponding relationship between a part of the input data and the output data, a function is generated to map the input to the appropriate output, such as classification and regression.

  • Semi-supervised learning : Comprehensive use of labeled data and unlabeled data to generate suitable classification functions.

  • Unsupervised learning : Modeling the input dataset directly, e.g. clustering.

In fact, a lot of machine learning is solving the problem of category attribution, that is, given some data, to determine which category each piece of data belongs to, or which other data belongs to the same category, and so on. In this way, if we come up with a certain division (clustering) of this pile of data, and automatically organize the data into certain categories through some inherent attributes and connections of the data, this belongs to unsupervised learning . If we know the categories that these data contain from the beginning, and some of the data (training data) have been marked with class labels, we can summarize the data that has been marked with class labels, and get a "data->category "The mapping function to classify the remaining data, which belongs to supervised learning . Semi -supervised learning refers to the method of improving the learning accuracy by using some data without class labels when the training data is very scarce .

For both supervised learning and semi-supervised learning , a certain amount of labeled data is required , that is to say, when training the model, all or part of the data needs to be labeled with the corresponding labels to train the model. When we use some traditional supervised learning methods for classification, the larger the training sample size, the better the classification effect.

However, in many scenarios in real life, it is difficult to obtain labeled samples , which requires experts in the field to manually label, which takes a lot of time and economic costs . Moreover, if the size of the training samples is too large, the time spent on training will be relatively large .

  • In the field of image annotation in the industry , although there is ImageNet, an image database used in academia and industry, in many special business scenarios, practitioners still need to find ways to obtain business annotation data.

  • In the field of security risk control , black users are relatively small compared with normal users. Therefore, how to build a model with very few black users is one of the issues worth thinking about.

  • In the field of business operation and maintenance , the failure time of servers and apps is relatively small compared to the normal operation time, and there will inevitably be unbalanced samples.

So how to obtain more valuable labeled data with less cost and further improve the effect of the algorithm is a question worth thinking about.

Active Learning (Active Learning) provides us with this possibility. Active learning uses a certain algorithm to query the most useful unlabeled samples and hand them over to experts for labeling, and then uses the queried samples to train the classification model to improve the accuracy of the model.

When active learning is not used, generally speaking, the system will randomly select from samples or use some artificial rules to provide samples to be marked for manual marking. Although this can also bring about a certain effect improvement, the labeling cost is always relatively large . (Using an example as a metaphor, a high school student hopes to improve his test scores by doing mock test questions for the college entrance examination. Then there are several choices in the process of doing the questions. One is to randomly choose from the past years of college entrance examination and mock test papers. Select a batch of questions to do to improve test scores. But it takes a long time to do so, and the pertinence is not strong enough; another method is for each student to create their own wrong question book to record The exercises that I am easy to do wrong, repeatedly consolidate the questions that I have done wrong, and consolidate my knowledge points that are easy to make mistakes by reviewing the questions that I have done wrong many times, and gradually improve my test scores. The idea of ​​​​active learning is to choose a batch of Sample data that is easy to be misclassified, let humans label it, and then let the machine learning model train the process.)
insert image description here

In the process of human learning, we usually use existing experience to learn new knowledge , and rely on acquired knowledge to summarize and accumulate experience , and experience and knowledge are constantly interacting . Similarly, machine learning simulates the process of human learning, uses existing knowledge to train the model to acquire new knowledge , and corrects the model through the accumulated information to obtain a more accurate and useful new model. **Different from passive learning, which passively accepts knowledge, active learning can selectively acquire knowledge.

3. What is active learning?

According to the above, in real data analysis scenarios, we can obtain massive amounts of data , but these data are unlabeled data , and many classic classification algorithms cannot be used directly . Then someone will definitely say that the data is not marked, so we will mark the data! This kind of thinking is normal and simple, but the cost of data labeling is very high. Even if we only label thousands or tens of thousands of training data, the time and money costs of labeling data are also huge.

Before introducing the concept of active learning, first talk about the issue of sample information.

Simply speaking, sample information means that in the training data set, each sample brings different information to model training , that is, the contribution of each sample to model training is different , and there are differences between them .

Therefore, in order to reduce the training set and labeling costs as much as possible , in the field of machine learning, an active learning method is proposed to optimize the classification model.

Why is active learning useful? Let's feel through an intuitive example.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-UkzuGCsI-1678104171916) (C:\Users\XueFeng Liu\AppData\Roaming\Typora\typora-user-images \image-20230305203940016.png)]

(a) A dataset consisting of 400 instances uniformly taken from two classes of Gaussian distributions. Instances are represented as points in a two-dimensional feature space. (b) A logistic regression model is trained by randomly sampling 30 labeled instances from the problem domain . This blue line represents the decision boundary of the classifier (70% accuracy). © Logistic regression model (90%) trained on 30 active query instances using uncertainty sampling .

This shows that the contribution of samples to the model is not the same, and it is of practical significance to select more valuable samples. Of course, how to determine and evaluate the value of samples is also a focus of active learning research.

So what is the overall idea of ​​Active Learning?

In the modeling process of machine learning, it usually includes several steps such as sample selection, model training, model prediction, and model update. In the field of active learning, it is necessary to add the two steps of annotation candidate set extraction and manual annotation to the overall process. Active learning refers to such a learning method. The active learning model is as follows:
A = ( C , Q , S , L , U ) A=(C,Q,S,L,U)A=(C,Q,S,L,U )
_

  • L is the labeled sample for training
  • C is a group or a classifier
  • Q is a query function, which is used to query information with a large amount of information from the unlabeled sample pool U
  • U is the unlabeled sample pool
  • S is the supervisor, who can label the samples in U correctly.

The learner starts learning with a small number of initially labeled samples L , selects one or a batch of the most useful samples through a certain query function Q , and asks the supervisor S for labels , and then uses the acquired new knowledge to train the classifier and proceed to the next step . Round query . Active learning is a cyclic process until a certain stopping criterion is reached.

Active learning is a subfield of machine learning, also called query learning or optimal experimental design in the field of statistics , which aims to achieve the target performance with as few labeled samples as possible.

sample selection algorithm

According to the way of obtaining unlabeled examples, active learning can be divided into two types: flow-based and pooling-based.

In stream-based active learning , unlabeled samples are submitted to the selection engine one by one in sequence, and the selection engine decides whether to label the currently submitted sample, and discards it if not.

In pool-based active learning, a collection of unlabeled samples is maintained, and the selection engine selects the current sample to be labeled in the collection.

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-lVB0I7U3-1678104171919) (C:\Users\XUEFEN~1\AppData\Local\Temp\WeChat Files\2a0eed79d29c8718255472309384820 .jpg)]

The query strategy (Query Strategy Frameworks) is the core of active learning, usually you can choose the following query strategies:

  1. Uncertainty Sampling query ( Uncertainty Sampling );
  2. Committee-based query ( Query-By-Committee );
  3. Query based on model change expectations ( Expected Model Change );
  4. Query based on error reduction ( Expected Error Reduction );
  5. Queries based on variance reduction ( Variance Reduction );
  6. Density weight based queries ( Density-Weighted Methods ).

Uncertainty Sampling

As the name implies, the query method of uncertainty sampling is to extract the indistinguishable sample data in the model and provide it to business experts or labelers for labeling, so as to achieve the ability to improve the algorithm effect at a faster speed . The key to the uncertainty sampling method is how to describe the uncertainty of the sample or data . There are usually the following ideas:

  1. The lowest confidence (Least Confident);
  2. Margin Sampling;
  3. entropy method (Entropy);

Least Confident

For two-category or multi-category models, they are usually able to score each data to determine which category it is more like. For example, in a binary classification scenario, two data are predicted by a certain classifier, and the predicted probabilities for the two categories are: (0.9,0.1) and (0.51, 0.49). In this case, the probability of the first data being judged as the first class is 0.9, and the probability of the second data being judged as the first class is 0.51, so the second data is obviously more "difficult" to be distinguished, and therefore more There is the value that continues to be marked. The so-called Least Confident method is to select the samples with the largest probability and the smallest probability for labeling. The mathematical formula is:
x LC ∗ = argmax ⁡ x ( 1 − P θ ( y ^ ∣ x ) ) = argmin ⁡ x P θ ( y ^ ∣ x ) , x_{LC}^{*}=\operatorname{argmax}_{x}\left(1-P_{\theta}(\hat{y} \mid x)\right)\\ =\operatorname{ argmin}_{x} P_{\theta}(\hat{y} \mid x),xLC=argmaxx(1Pi(y^x))=argminxPi(y^x ) ,
andy ^ = argmax ⁡ y P θ ( y ∣ x ) \hat{y}=\operatorname{argmax}_{y} P_{\theta}(y \mid x)y^=argmaxyPi(yx ) , whereθ \thetaθ represents a set of trained machine learning model parameters. y ^ \hat{y}y^for xxIn terms of x, it is the category with the highest probability predicted by the model. The Least Confident method considers the sample data for which the model predicts the highest probability but with low confidence.

Margin Sampling

Margin sampling refers to the selection of sample data that is easily judged as two types, or the probability of these data being judged as two types is not much different. Edge sampling is to select the sample with the smallest probability difference between the largest model prediction and the second largest probability, which is described by a mathematical formula: degree difference)
x M ∗ = argmin ⁡ x ( P θ ( y ^ 1 ∣ x ) − P θ ( y ^ 2 ∣ x ) ) x_{M}^{*}=\operatorname{argmin}_{ x}\left(P_{\theta}\left(\hat{y}_{1} \mid x\right)-P_{\theta}\left(\hat{y}_{2} \mid x\ right)\right)xM=argminx(Pi(y^1x)Pi(y^2x ) )
wherey ^ 1 \hat{y}_{1}y^1y ^ 2 \hat{y}_{2}y^2represent respectively for xxFor x , the model predicts the most likely class and the second most likely class.
In particular, for the two-category problem, least confident and margin sampling are actually equivalent.

Entropy _

In information theory, entropy can be used to measure the uncertainty of a system . The greater the entropy, the greater the uncertainty of the system , and the smaller the entropy, the smaller the uncertainty of the system. Therefore, in the scenario of binary classification or multi-classification, sample data with relatively large entropy can be selected as undetermined label data. Expressed in a mathematical formula:
x H ∗ = argmax ⁡ x − ∑ i P θ ( yi ∣ x ) ⋅ ln ⁡ P θ ( yi ∣ x ) , x_{H}^{*}=\operatorname{argmax}_{ x}-\sum_{i} P_{\theta}\left(y_{i} \mid x\right) \cdot \ln P_{\theta}\left(y_{i} \mid x\right),xH=argmaxxiPi(yix)lnPi(yix),
compared with the least confident and margin sample, the method of entropy considersthe model's impact on a certainxxAll category judgment results of x . The least confident only considers the largest probability, and the margin sample considers the largest and second largest two probabilities.

application

Its fields of application include:

  1. Personalized spam, text messages, content classification: including marketing text messages, subscription mail, spam text messages and mail, etc.;

  2. Anomaly detection: including but not limited to security data anomaly detection, illegal account identification, time series anomaly detection, etc.

  3. Including image recognition, natural language processing, security risk control and many other fields.

Summarize

In the field of active learning (Active Learning), the key lies in how to select a suitable label candidate set for manual labeling , and the selection method is the so-called query strategy (Query Strategy) . The query strategy can basically be based on a single machine learning model or multiple machine learning models, which can be determined according to the actual situation in actual use. On the whole, active learning exists to reduce the cost of labeling and quickly improve the effect of the model .

Guess you like

Origin blog.csdn.net/weixin_48266700/article/details/129369637