Dry goods|How to deal with unbalanced data?

How to deal with unbalanced data?

Today let us talk about what we should do when we encounter some unbalanced data in machine learning.

Unbalanced data is usually more fixed in form and easier to distinguish. For example, you have apples and pears on hand. At this time, the data in your hands tells you that people all over the world eat pears. At this time, you go to a passerby and ask him if you like to eat pears.At this time, most of us will guess that this person will eat pears. At this time, pears can become very proud of the advantage data.

At this time, let us introduce today's problem, how to deal with imbalanced data.

In fact, unbalanced data is very simple to understand and predict. It will always be the side that predicts more data. This is not wrong, especially the side with a lot of data, such as 90% of the data, and 10 of the less. %. As long as the larger batch of data is predicted each time, the accuracy of the prediction can reach 90%.

Yes, does this sound a bit lazy? In fact, the machine also understands these little tricks, so after training, the machine has become smarter and predicts the more data every time, but this is not possible ! Next, let’s talk about several ways to solve this problem

1   Method 1: Find a way to get more data

First of all, we have to think about whether we can get more data. Sometimes we are in the early stage of obtaining data. Usually the data will show a trend of change. At this time, it is manifested as a certain amount of data. For half of the period, the trend of data changes may be different.

If the data for the second half of the period is not obtained, the forecast may not be as accurate as a whole. So finding a way to obtain more data may improve the situation~

2   Method 2: Change another way of judging

Under normal circumstances, we will use accuracy and cost to judge the results of machine learning. But in the face of uneven data, high accuracy and low error are not so useful and important. Up.

So we can calculate it in another way. Many times we will use Confusion Matrix to calculate Precision&Recall, and then use Precision&Recall to calculate F1 Score (or F-score). Through such data, we can largely distinguish uneven data. , And can give better scores. Because of my level, the specific calculation and reasoning process will be deduced in the future! (立flag)

3   Method 3: Reorganize data

The third method is relatively simple and rude, recombining unbalanced data to balance it.


The first way is to copy the samples in the minority data so that it can reach the same number as the majority data samples.


The second way is to cut the data of most samples, cut off some data of most samples, or make the two numbers similar


4  Method 4: Use other machine learning methods


In the use of some machine learning methods, such as neural networks, they are helpless when facing imbalanced data, but methods such as decision trees will not be affected by imbalanced data.


5   Method 5: Modify the algorithm

Among all the methods, the most creative method is this modified algorithm. If you use the Sigmoid function, it will have a prediction threshold. If it is below the threshold, the predicted result is a pear. If it exceeds the threshold, The predicted result is Apple.


However, because there are too many pears now, we need to adjust the position of the lower threshold at this time to make the threshold more biased towards Apple.Only when the data is very accurate, the model will predict Apple, so that machine learning can learn. To better results.

image.png

Ok, here are some simple summaries of this article. If you want to learn more about machine learning, welcome to follow the school girl's short book channel~ see the original link ~


Please contact [email protected] for submission

Preview of recent articles!

"Plainly explain why the initialization of neural network parameters cannot be all 0"

"Popular Explanation of Hidden Markov Model (HMM)-Backward Algorithm"

"Popular Explanation of Hidden Markov Model (HMM)-Viterbi Algorithm"

I have recently graduated relatively busy~ The push may be late, but as long as the push is the dry amount~


Recommend to read the article!

Hidden Markov Model-Basic Model and Three Basic Questions

In-depth understanding of decision tree algorithm (1)-core ideas

Naive Bayesian classification example-word correction problem


All hard and easy to understand! Just stick to the top~ Welcome to follow the exchange~


image.png


Guess you like

Origin blog.51cto.com/15009309/2553578