1.4 Data labeling
1. Data labeling process
- Are there enough labels
- Yes: Small models can learn semi-supervisedly
- No: sufficient cost
- Yes: Crowdsourcing, Paid Annotation
- No: weak annotation
2 Semi-supervised learning SSL
It is mainly applicable to only a small part of the data set with data annotation, and a large part without data annotation.
For example, in the Taobao recommendation scenario, only a small number of users browsed and purchased products, but most users just browsed the products and did not do anything else, which means that only a small part of the data has feedback and labels, and the rest are not marked. .
If you want to use labeled data and unlabeled data together to train the model, this is semi-supervised learning, but the premise of using it is that there are some assumptions:
- Continuity assumption: If two data have similar characteristics, it is assumed that the two data have the same label
- Clustering hypothesis: data can be divided into many clusters, heaps, and data in the same cluster have the same label
- Flow assumption: The real dimension of the data may be lower than the apparent dimension, which can be dealt with by dimensionality reduction.
3 self-learning
Process: label data –> training –> model –> number of pseudo labels –> data integration
Unlabeled data –> model
Note: 1. You can choose a more expensive model, maintain a high confidence level, and will not be online