3, wherein the cleaning

1, cleaning

Removing the dirty data, such as scalping of some data items, the default value of the plurality of data, the abnormal data directly discarded in general.

Service situation

For example, the removal crawler to crawl, spam, cheating and other data

Using the outlier detection algorithm

Deviation detection: clustering, nearest neighbor, etc.

Statistics based on the abnormality detection: poor e.g., interquartile range, mean difference, standard deviation.

Outlier detection based on the distance: the distance between the point most points greater than a certain threshold are considered outlier, there are ways absolute distance (Manhattan distance), Euclidean distance and Mahalanobis distance and other major use of distance metrics.

Abnormality detection based on density: Investigation of the current density around the point can be found a local outliers, e.g. LOF algorithm.

2, sampling

After collecting cleaning, sample distribution is not balanced , it should be sampled.

The difficult problem of a sequence from small to large: a large data distribution equalized + <uneven distribution of large data + <+ small data equalized data <data + data small imbalance . Description small data set is a machine learning troublesome issue.

There are two ways:

(1) From a data point of view, the main method for the sampling, since our sample is unbalanced, it can be sampled by a certain policy, so that our data is relatively balanced number;

(2) From the perspective of algorithm consideration the differences between the cost of misclassification cases the algorithm is optimized so that our algorithm in imbalanced data can also have a good effect.

Angle data

There are two types: turn back and without replacement.

(Randomized sub-sampling) Sample Bottom: few samples with fewer samples are combined into a class equalized data sets from most class sample.

(Random oversampling) sampling: repeatedly extracted with replacement from the set of data samples minority, majority class sample set consisting of the equalized data sets.

Problems:

  1. Cause loss of information, not sampled sample may contain important information, the model could only learn part;
  2. When a small number of class samples, a new data set is formed which may be repeated many samples, so easily lead to over-fitting;
  3. On sampling the minority sample copy multiple copies, a recurring point in space at a high level, can cause a problem, good luck points to a number of points, these points bad luck the whole part wrong. Solution to this problem may generate a new data point is to add some slight random perturbation.
  4. In conjunction with cross-validation! !

Algorithms angle

Data synthesis

Data synthesis method is to utilize the existing sample to generate more samples, these methods have a lot of success stories in a small scene data, such as medical image analysis.

One popular method is SMOTE, the similarity in the feature space to generate a new sample using a small sample: For a small minority of samples x, k from its neighbors in a randomly selected sample point x ', generates a new sample point x new new  = X + (X '- X) × [delta]; wherein the δ∈ [0,1] is a random number.

SMOTE niche is the same number of samples synthesis of new samples, but also bring some potential problems: (1) increase the possibility of overlap between the class; (2) generate a number of samples does not provide useful information. Borderline-SMOTE and ADASYN: To solve this problem, two approaches appear.

Borderline-SMOTE solution idea is to find those niche samples should be used for the synthesis of new samples. This small minority of samples is characterized by: (k neighbor more than half) It is surrounded by most of the public sample, since such samples often border samples.

Weighted

Of different categories, different Cost of wrong.

The difficulty is how to set a reasonable weight. Practical application so that the weighted cost value approximately equal between each category. Not universal, or to analyze specific issues.

A Category

For positive and negative samples very uneven scene, we can change a completely different perspective on the problem: think of it as a classification (One

Class Learning) or the abnormality detection (Novelty

Detection) problem. The focus of such methods is not to capture the difference between the classes, but for one type of modeling, the classic work includes One-class SVM and so on.

The voice of experience

1, in the positive and negative samples are very small, the synthesis of the data should be used;

2, sufficient sample in negative, positive and very little sample under widely varying conditions and the ratio should be considered a classification method;

3, the positive and negative samples are sufficient and at a ratio not particularly poor conditions, or weighted sampling method should be considered.

4, sampling and weighted mathematically equivalent, but there is a difference in practical application results. In particular, such as sampling, etc. and Random Forest method, the training process will carry out random sampling of the training set. In this case, if the computing resources permit on weighted sample than is often better .

5, Further, although sampling and down-sampling may be set so that the data becomes balanced, and is equivalent in the case of enough data, but both are different. Practical application, my experience is that if calculated on the use of resource-sufficient and minority class sample a sufficient number of cases sampling, or sampling the next use, because the sampling will increase the size of the training set, thus increasing the training time, while a small training set very prone to over-fitting.

6, for downsampling, if relatively high computing resources and has good parallel environment, should be selected Method Ensemble.

Guess you like

Origin www.cnblogs.com/pacino12134/p/11368703.html