Bert coding training NER entity offset problem

Record the pits stepped on here

In the training sample, there is almost no case of a large string of English or numbers in front, followed by entities, so the training is quite stable.
However, this situation was encountered during the prediction, which caused the entity to be identified, but the subscript of the entity's prediction result was wrong (shown as the subscript was advanced)

The reason for this problem is that when passing the text to bert to get the text characteristics, you can pass the text of the string, or you can pass a list after the token is completed by yourself.

If what is passed is a string, the token of the default system is not a single word token for the processing of English numbers, and a mechanism of contraction and generalization is adopted. So the words will be reduced, and then the entity tags will be advanced.

If it is a NER task, tokenize the data yourself, and then train better.

Interested students can look at the source code:
def convert_lst_to_features(lst_str, seq_length, tokenizer, is_tokenized=False):
    """Loads a data file into a list of `InputBatch`s."""

    examples = read_tokenized_examples(lst_str) if is_tokenized else read_examples(lst_str)

    _tokenize = lambda x: tokenizer.mark_unk_tokens(x) if is_tokenized else tokenizer.tokenize(x)

Guess you like

Origin blog.csdn.net/cyinfi/article/details/90349894