PU Learning Description: The unlabeled data for semi-supervised classification

When only a few positive samples, how do you classify unlabeled data

Suppose you have a business transaction data sets. Some transactions are marked as fraudulent, the rest of the transaction is marked as the real deal, so you need to design a model to distinguish between fraudulent transactions and real transactions. Assuming you have enough data and good features, which seems to be a simple classification task. However, assuming that the data set only 15% of the data is flagged and labeled sample belongs to only one category, namely training set of 15% of the samples labeled as the real deal, and the remaining samples unlabeled, may be true transaction sample, it could be fraud sample. How will you classify? Sample imbalance problem is the task became unsupervised learning problem? Well, not necessarily.

This problem is often referred to as PU (positive samples and unlabeled) classification, the first question is similar to the phase difference between the two and the common "tag questions", these two issues many classification tasks complicated. The first and most common problem is a small label set of training problems. When you have a lot of data but in fact only a small part is marked, this situation occurs. There are many types of this issue and a number of specific training methods. Another common problem label (usually confused with PU problem) is that the training data set is fully marked but only one category. For example, suppose we have is a non-data sets fraudulent transactions, and we need to use this data set to train a model to distinguish between non-fraudulent transactions and fraudulent transactions. This is also a common problem, generally considered unsupervised outlier detection problem, in the field of machine learning, there are many tools designed to handle these situations (OneClassSVM possibly the most famous).

In contrast, PU classification relates to the training set, wherein only the part of data is marked as positive, and the remaining data is not marked, it may be positive or negative. For example, suppose you work in a bank, you can get a lot of transactions, but can only confirm some of which are 100% true. I will cite examples will be related to counterfeit. This example includes a 1200 data sets banknotes, most of which is not labeled, only a portion is recognized as the real one. Although the PU is also very common problem, but compared with the previously mentioned two classification problems, the number of the issue in question is usually much less, and few actual examples or libraries are available.

The purpose of this article is to propose a feasible way to solve the problem PU, I recently used this approach in a classified project. It Charles Elkan and Keith Noto paper-based "Learning classifiers from only positive and unlabeled data" (2008), as well as some code written by Alexandre Drouin. While more PU learning (I intend to discuss another popular method in future articles) in scientific publications, but Elkan and Noto (E & N) method is very simple and easy in Python achieve.

A little bit of theory

Here Insert Picture Description
That essentially E & N method, in the case where Given a positive sample and unlabeled sample data set, a sample is positive probability [P (y = 1 | x )] is equal to the probability of a sample is labeled [P (s = 1 | x)] divided by the sample is labeled a positive probability [P (s = 1 | y = 1)].
If this statement is true (and I'm not going to prove or disprove it - you can read the papers to prove itself and validation code), it appears to be relatively easy to implement. The reason for this is because although we do not have enough data to train the classifier tag to tell us sample is positive or negative, but in the PU issues, we have enough data to tell us mark a positive sample if possible be marked. The method of E & N, which is sufficient to estimate whether the sample is a sample positive
More formally, given a set of unlabeled data, only one set of samples is labeled as positive samples. Fortunately, if we can estimate P (s = 1 | x) / P (s = 1 | y = 1), then you can use any estimate based sklearn classified according to the following steps:
(1) the classification used on the data set contains tagged and untagged samples, while using the indicator as a target marked y, in this way be fitted classifier trained to predict the probability of a given sample are labeled x P (s = 1 | x).
(2) using a classifier to predict dataset known probability of positive samples are labeled using the predicted result indicates that the probability of a positive sample is labeled - P (s = 1 | y = 1 | x)
to calculate the mean of these predicted probabilities, obtain P (s = 1 | y = 1).
in the estimation of P (s = 1 | y = 1) then, in order according to E & N method of prediction data point k probability of positive samples is, we have to do is to estimate P (s = 1 | probability k) or labeled K, which is the classifier (1) did.
(3) the use of our trained classifier (1) to estimate the probability of being labeled K or P (s = 1 | k)
(4) Once we estimate P (s = 1 | k), we can divide by k in step (2) estimated in P (s = 1 | y = 1) to classify k, and then obtains it belongs to two classes of real probability.

Write code and test it now

1-4 above may be implemented in the following manner:

# prepare data
x_data = the training set
y_data = target var (1for the positives and not-1for the rest)
# fit the classifier and estimate P(s=1|y=1)
classifier, ps1y1 = 
       fit_PU_estimator(x_data, y_data, 0.2, Estimator())
# estimate the prob that x_data is labeled P(s=1|X)
predicted_s = classifier.predict_proba(x_data)
# estimate the actual probabilities that X is positive# by calculating P(s=1|X) / P(s=1|y=1)
predicted_y = estimated_s / ps1y1

Let's start with the main method here: fit_PU_estimator () method.
fit_PU_estimator () method completes the two main tasks: it fits your data in positive samples and unlabeled samples of concentrated select the appropriate category, and then estimate the probability of positive samples marked. Accordingly, it returns a fitting classifier (learned estimated probability of a given sample are labeled) and the estimated probability P (s = 1 | y = 1). After that, we have to do is find P | probability (s = 1 x) or x marked. Because the classification is so trained, so we just call it predict_proba () method can be. Finally, in order to classify the actual sample x, we only need to have been the result of dividing the P (s = 1 | y = 1).
Represented by the code:

pu_estimator, probs1y1 = fit_PU_estimator(
  x_train,
  y_train,
  0.2,
  xgb.XGBClassifier())

predicted_s = pu_estimator.predict_proba(x_train)
predicted_s = predicted_s[:,1]
predicted_y = predicted_s / probs1y1

fit_PU_estimator () implementation method itself is very simple

deffit_PU_estimator(X,y, hold_out_ratio, estimator):# The training set will be divided into a fitting-set that will be used # to fit the estimator in order to estimate P(s=1|X) and a held-out set of positive samples# that will be used to estimate P(s=1|y=1)# --------# find the indices of the positive/labeled elementsassert (type(y) == np.ndarray), "Must pass np.ndarray rather than list as y"
    positives = np.where(y == 1.)[0] 
    # hold_out_size = the *number* of positives/labeled samples # that we will use later to estimate P(s=1|y=1)
    hold_out_size = int(np.ceil(len(positives) * hold_out_ratio))
    np.random.shuffle(positives)
    # hold_out = the *indices* of the positive elements # that we will later use  to estimate P(s=1|y=1)
    hold_out = positives[:hold_out_size] 
    # the actual positive *elements* that we will keep aside
    X_hold_out = X[hold_out] 
    # remove the held out elements from X and y
    X = np.delete(X, hold_out,0) 
    y = np.delete(y, hold_out)
    # We fit the estimator on the unlabeled samples + (part of the) positive and labeled ones.# In order to estimate P(s=1|X) or  what is the probablity that an element is *labeled*
    estimator.fit(X, y)
    # We then use the estimator for prediction of the positive held-out set # in order to estimate P(s=1|y=1)
    hold_out_predictions = estimator.predict_proba(X_hold_out)
    #take the probability that it is 1
    hold_out_predictions = hold_out_predictions[:,1]
    # save the mean probability 
    c = np.mean(hold_out_predictions)
    return estimator, c

defpredict_PU_prob(X, estimator, prob_s1y1):
    prob_pred = estimator.predict_proba(X)
    prob_pred = prob_pred[:,1]
    return prob_pred / prob_s1y1

To test this, I used the bill data set, the data set is based on extracts from the real and forged banknotes picture four data points. I was the first to use the classifier on the mark to set a baseline data set, then delete the label 75% of the samples to test its performance on the P & U dataset. As output, the data set is not really the most difficult to classify data sets, but you can see that although the PU classifier only understand about 153 positive samples, while all the rest of 1219 were not marked, but the whole tag classifier compared to its performance very well. However, it does lost about 17% recall rate, and therefore lost a lot of positive samples. However, I believe that compared with other programs, the result is quite satisfactory of.

===>> load data set <<===data size: (1372, 5)Target variable (fraud or not):
07621610===>> create baseline classification results <<===Classification results:f1: 99.57%
roc: 99.57%
recall: 99.15%
precision: 100.00%===>> classify on all the data set <<===Target variable (labeled or not):
-112191153Classification results:f1: 90.24%
roc: 91.11%
recall: 82.62%

Few points

First, this method heavily depends on the size of the data set. In this example, I used about 150 and about 1200 positive samples unlabeled samples. This data set far over the process. For example, if we only 100 samples, the effect of the classifier will be very poor. Next, as shown in the accompanying notes, some variables need to be adjusted (e.g. sample size to be set, a probability threshold classifier, etc.), but the most important is probably its parameters selected classification. I chose to use XGBoost, because of its performance in the few features as small data set is relatively good, but note that it is not performed well in each case, and also test the correct classification Very important.

Author: Alon Agmon
Deephub translation Group: gkkkkkk

Published 20 original articles · won praise 9 · views 20000 +

Guess you like

Origin blog.csdn.net/m0_46510245/article/details/105166303