Practical Machine Learning Notes (2): Data Labeling

Data annotation

1. Data labeling process

  1. Have enough data? Is the data sufficient?
  2. improve label, data, or model? Which part of the label, data and model needs to be improved
  3. enough label? Is the label sufficient?

    semi-supervised learning semi-supervised learning

  4. enough budget ?

    label via crowd sourcing Rich and willful

  5. user weak label? Is it possible to use a poor label

    weak supervising weakly supervised learning

2. semi-supervised learning (SSL): semi-supervised learning

Recommended reading:

  1. focus on the scenario where there is a small amount of labeled data, along with large amount of unlabeld data (A small part of the data is labeled, and most of the data is not labeled

  2. make assumptions on data distribution to use unlabeled data (Make assumptions about the distribution of unlabeled data

    • continuity assumption:examples with similar features are more likely to have the same label(相似假设)
    • cluster assumption : data have inherent cluster structure ,example in the same cluster tend to have the same label (聚类假设)
    • mainifold assumption: the data lie on a manifld of much lower dimension than the input space (流行假设)
  3. self-training (self-training)

    self training is a SSL method

  4. we can use expensive models (Big Brother Game: Deepen the network, model integration, if one model doesn’t work, just n)

    • deep neural networks ,model ensemble/bagging

3. Challenges of Data Labeling

  1. simplify user interaction: design easy tasks ,clear instructions and simple to use interface (Design simple and clear labeling tasks)
    • THE user instruction and task used by the MIT place365 dataset
  2. cost: active learning + self-training (considering the cost of labeling)
    • focus on same scenario as SSL but with human intervention (SSL with manual intervention)
    • uncertainty sampling chooses an example whose prediction is most uncertain (sampling screening label)
    • similar to self-training we can use expensive models
      • query by committee trains multiple models and performs major voting
  3. quality control: label qualities generated by different labels vary (control the quality of label information)
    • simplest but most expensive : sending the same task to multiple labeled then determinne the label by majority voting (The easiest but most expensive way is to assign the same task to different people and then vote)

Guess you like

Origin blog.csdn.net/jerry_liufeng/article/details/123350092