Article directory

Introduction
model structure
Experimental results
discuss

For more and more timely content, welcome to the WeChat public account : Xiaochuangyouji machine learning onlookers.

Introduction

Paper address:
https://arxiv.org/abs/2203.06378

Paper code:
https://github.com/daiyongya/markbert

MarkBERT is also a solution that considers how to introduce word information into the model. MarkBERT is based on the word model, but cleverly integrates word boundary information into the model. The specific operation is to insert boundary markers between words. The presence of a boundary marker means that the preceding character is the end character of a word and the following character is the start character of another word. In this way, all words can be processed uniformly, and there is no OOV problem of words. In addition, MarkBERT has two additional advantages:

It is convenient to add a word-level learning target on the boundary marker (the article uses
the learning task of replaced word detection), which can be used as a supplement to traditional character (such as MLM task) and sentence-level (such as NSP task) pre-training tasks
It is convenient to introduce rich semantic information. For example, if you want to introduce the POS tag information of a word, you can replace the marker with a POS tag-specific tag .

MarkBERT achieved SOTA results on the Chinese NER task. 95.4% -> 96.5% on the MSRA dataset and 82.8% -> 84.2% on the OntoNotes dataset. MarkBERT also achieved better accuracy in text classification, keyword recognition, and semantic similarity tasks.

There are two tasks in the MarkBERT pre-training phase:

MLM: Mask the boundary marker so that the model can learn boundary knowledge.
Replaced word detection (replaced word detection): Manually replace a word, and then let the model distinguish whether the word in front of the marker is correct.

model structure

MarkBERT Model:
first segment the word, and insert a special marker in the middle of the word (this special marker is used in the article [S]). These markers will also be treated as ordinary characters, with corresponding position codes, and will also be masked. In this way, you need to pay attention to the word boundaries when encoding, instead of simply filling in the masked parts according to the context. This makes the MASK prediction task more challenging (prediction requires a better understanding of word boundaries). In this way, the character-based MarkBERT model incorporates word-level information through word boundary information (word information is explicitly given).

insert image description here

Replacement word detection:
Specifically, when a word is replaced with a confusing word , the marker should make a "replaced" prediction, the label is False, otherwise it is True. Assuming $The representation of i$ marker is denoted as $x^i$ , the label corresponding to the correct prediction and the error is recorded as $y^{true}$ 和 $y^{false}$ , the replacement word detection loss is defined as follows:

$\mathcal{L}=-\sum_{i}\left[y^{\text {true }} \cdot \log \left(x_{y}^{i}\right)+y^{\text {false }} \cdot \log \left(x_{y}^{i}\right)\right]$

This loss function and the loss function of the MLM are added together as the final training loss. Confused words come from synonyms or words with similar pronunciation . By replacing the word detection task, the marker is more sensitive to the word span in the context. In order to further integrate semantic information, the result of POS can be used as a boundary marker. As shown in Figure 1, the model using POS part-of-speech tagging information as a boundary marker is called MarkBERT-POS.

The proportion of pre-training
MASK is still 15%, 30% of the time does not insert any marker (original BERT); 50% of the time performs WWM (whole-word-mask) prediction tasks; the rest of the time performs MLM prediction tasks.

In the insertion marker, 30% of the time, the word is replaced with a pronunciation-based confusion word or a synonym-based confusion word, and the marker predicts the pronunciation confusion mark or the synonym confusion marker (ie False); other times the marker predicts the marker of the normal word (ie True ). To avoid label imbalance, only 15% loss on normal markers is calculated.

Experimental results

The effect on the NER task is shown in the table below:

insert image description here

It can be seen that the effect improvement is still obvious.

Ablation experiments were done on three tasks:

MarkBERT-MLM: MLM task only
MarkBERT-rwd: Remove near-sounding words or synonyms, respectively, during replacement word detection
MarkBERT-w/o: Remove Marker when fine-tuning downstream tasks (same usage as original BERT)

The results of the ablation experiment are shown in the table below:

insert image description here

From the ablation results, it can be seen that:

MarkBERT-MLM (no replacement word detection task) has a significant improvement in NER tasks, indicating that word boundary information is important in fine-grained tasks.
Without inserting markers, MarkBERT-w/o also achieves a similar effect to the baseline, indicating that relatively simple MarkBERT can be used like BERT in language understanding tasks.
For NER tasks, inserting markers is still important, and experimental results show that MarkBERT is effective in learning word boundaries for tasks that require such fine-grained representations.

discuss

The existing Chinese BERT has two strategies for integrating word information:

Use word information in the pre-training stage, but use character sequences on downstream tasks, such as Chinese-BERT-WWM, Lattice-BERT.
Use word information when using pre-trained models in downstream tasks, such as WoBERT, AmBERT, Lichee.

In addition, the idea of inserting markers has been explored in NLU tasks related to entities, especially in relational classification. Given a subject entity and object entity, existing work injects untyped markers or entity-specific markers and makes better predictions about the relationship between entities.

insert image description here

In addition, in the prediction stage, marker information is also needed, and this information may have errors. The author has not done further ablation research on this aspect.

[Thesis Interpretation Series] NER Direction: MarkBERT (2022)

Article directory

Introduction

model structure

Experimental results

discuss

Guess you like