How to deal with missing data in machine learning?

  1. If the sample with missing values ​​accounts for a very high proportion of the total number, it is generally discarded directly. Otherwise, if it is added as a feature, it may bring noise and affect the result.
  2. If the missing value of the sample is moderate, and the attribute is a non-continuous value feature attribute (such as a category attribute), NAN can be added as a new category to the category feature.
  3. If the missing value of the sample is moderate, and the attribute is a continuous-valued feature attribute, you can consider a step to discretize it, and then add NAN as a type to the category attribute.
  4. If there are not too many missing values, you can fill in: fixed value filling, mean filling, upper/lower data filling, interpolation filling, algorithm fitting filling.
  5. When there are missing values ​​in train and no missing values ​​in test, the conditional mean or conditional median can be taken for the missing values. (The conditional mean is to take the mean of the attribute of all users under the label according to the label value category of the user)
  6. When both train and test have a large number of missing values, the missing value can be considered as a feature, and it is divided into 0 and 1 according to whether it is missing or not.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325733961&siteId=291194637