Data processing is not balanced (Imbalanced data)

What is not balanced data

 

 In the form of uneven data is very simple. There are apples and pears, when you find yourself in the hands of data to say to you, almost the whole world only pear,

If you grab a casual passerby, let you guess he ate an apple or pear, pear guess normal people.

 

 Uneven forecast data is simple. Never have guessed that more than one hand to be sure, especially red and more accounted for 90% of that party. Only once every

Guess predict when they are red, the prediction accuracy rate has already reached a very high 90%. Yes, the machine also know this little trick, so finally learned the machine,

Go astray, every prediction majority. There are several solutions, we talk about.

Get more data

 

 First of all, we have to think about it, I still can not get more data. Sometimes just because most of the data presented in the preceding period of a trend,

Trend until the second half of the period is a different matter. If the data is not acquired during the second half, the overall prediction may not so accurate.

Way to replace judge

 

 Usually, we will use the accuracy of accuracy, cost or error to judge the results of machine learning. However, these methods are not balanced judgment in the face of data,

High accuracy and low error becomes less important. So we have to put it another way judge. To calculate the precision and recall by confusion matrix,

Then recalculated by precision and recall f1 score. Uneven data points in this way can the success of the region, give better judgment scores.

 

Recombinant Data

 

 The third method is the most simple and crude method. Recombined imbalance data, so that equilibrium.

One way: copying samples minority portion or synthetic, like most similar number portion.

Second way: cut some of the most part, the number was much the same.

Other machine learning methods

 

 If the use of machine learning methods like neural networks, etc., in the face of unbalanced data, usually helpless.

But some machine learning methods, such as decision trees, decision trees will not be affected very unevenly data.

Modified algorithm

 

 The last method is to allow yourself to become creative, try to modify the algorithm. If you are using a Sigmoid activation function,

activation function, he predicted there will be a threshold, usually falls on this period if the output threshold, predictions for the pears, if falls on this period,

Predictions for Apple, but because pears are now the majority, we have to adjust the position of the threshold so that the threshold of Apple bias here, since only very

When the letter, the model predicted that Apple will let the machine learning, learning to better results.

Guess you like

Origin www.cnblogs.com/Lazycat1206/p/11911598.html