Large model fine-tuning sample construction trick

Interviewer: How does large model fine-tuning organize training samples?

You: One question and one answer for large model training, one instruction and one output. Questions and instructions can be used as prompt input, and answers as output. The part of calculating loss should block out the pad token.

Interviewer: How to organize training samples in multiple rounds of dialogue?

You: Assuming that the multiple rounds are Q1A1/Q2A2/Q3A3, then it can be converted into three training samples of Q1—>A1, Q1A1Q2->A2, Q1A1Q2A2Q3->A3.

Interviewer: In this case, one session becomes three pieces of data, and the above are repeated in sequence. Will there be any problems?

You: Most of the data are pad tokens, and the utilization efficiency of training data is low. In addition, there will be a problem of repeated expansion of data. The repeated expansion of training data is the number of sessions * the average number of rounds, and there are repeated parts in the above, and the training efficiency will also be low.

Interviewer: You also realized that, is there any way to improve it?

You: Is there a way to construct a session as a training sample at one time? (Thinking)

Interviewer: Reminder, limit to the decoder-only series of models, use the model features to improve the sample organization form.



For this problem, let’s think about the characteristics of the decoder-only model. The first point is that its attention form is casual. A simple understanding of casual is a triangular array. A single token can only see the information above it.

as the picture shows:

Guess you like

Origin blog.csdn.net/u013250861/article/details/131686901