1.4 Data labeling

1.4 Data labeling

1. Data labeling process

  • Are there enough labels
    • Yes: Small models can learn semi-supervisedly
    • No: sufficient cost
      • Yes: Crowdsourcing, Paid Annotation
      • No: weak annotation

insert image description here

2 Semi-supervised learning SSL

It is mainly applicable to only a small part of the data set with data annotation, and a large part without data annotation.

For example, in the Taobao recommendation scenario, only a small number of users browsed and purchased products, but most users just browsed the products and did not do anything else, which means that only a small part of the data has feedback and labels, and the rest are not marked. .

If you want to use labeled data and unlabeled data together to train the model, this is semi-supervised learning, but the premise of using it is that there are some assumptions:

  • Continuity assumption: If two data have similar characteristics, it is assumed that the two data have the same label
  • Clustering hypothesis: data can be divided into many clusters, heaps, and data in the same cluster have the same label
  • Flow assumption: The real dimension of the data may be lower than the apparent dimension, which can be dealt with by dimensionality reduction.

3 self-learning

Process: label data –> training –> model –> number of pseudo labels –> data integration

Unlabeled data –> model

Note: 1. You can choose a more expensive model, maintain a high confidence level, and will not be online

Guess you like

Origin blog.csdn.net/ch_ccc/article/details/129877358