0. Getting to know UIE first

Some time ago, the UIE model launched by Baidu was screened on various NLP-related official accounts, and it was basically introduced with the word "unified". Out of curiosity, I also installed paddle and experienced this model. According to the official prompt, good results have been obtained as expected. This is also the first time since the concept of prompt was proposed a year ago, I have come into contact with a model that surprises me from an application point of view. But after being pleasantly surprised, after careful consideration, I feel that this "unified" model does not seem to be particularly novel.

1. True or false UIE?

Maybe many people, like me, first came into contact with this model from paddleNLP's git, downloaded it to experience it, and then delved into its principle. But when I read the UIE paper, I looked at the code again, and when I saw the model structure, I was very surprised:

class UIE(ErniePretrainedModel):

    def __init__(self, encoding_model):
        super(UIE, self).__init__()
        self.encoder = encoding_model
        hidden_size = self.encoder.config["hidden_size"]
        self.linear_start = paddle.nn.Linear(hidden_size, 1)
        self.linear_end = paddle.nn.Linear(hidden_size, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, input_ids, token_type_ids, pos_ids, att_mask):
        sequence_output, pooled_output = self.encoder(
            input_ids=input_ids,
            token_type_ids=token_type_ids,
            position_ids=pos_ids,
            attention_mask=att_mask)
        start_logits = self.linear_start(sequence_output)
        start_logits = paddle.squeeze(start_logits, -1)
        start_prob = self.sigmoid(start_logits)
        end_logits = self.linear_end(sequence_output)
        end_logits = paddle.squeeze(end_logits, -1)
        end_prob = self.sigmoid(end_logits)
        return start_prob, end_prob

Isn't it the T5 that was promised, isn't it a generative model, why is it Erni?

From the model structure, it can be clearly seen that this is an extractive model, not the generative model described in the paper. It is based on the ERNIE encoder connected to the double-pointer decoding. Isn’t this structure the MRC model proposed by Shannon AI two years ago (MRC is the first paper I read after I entered NLP, so I was more impressed).

Seeing this, it can be said that it was unexpected, but it was expected. Surprisingly, it is completely different from the model in the UIE paper, and it is completely "wrong"; as expected, the direct structure of the model directly answers several questions I encountered when testing its effect: (1) Why can the model extract over-lap entities
?
— because decoding is a two-pointer task, not sequence annotation.
(2) The model claims to be able to extract events and relationships, but the prompt to extract events and relationships is so close in actual operation?
——Let’s press the button on this for now, and we’ll talk about it later.

So where is the real UIE? You have to look at the link given in the paper:
https://github.com/universal-ie/UIE
is a version of pytorch, which is the same as the idea described in the paper.

In addition, the paddle version also has this generative model, but it is called duuie, and it can also be found on git.

Therefore, it is not that the paper and the code are not on the same board, but that the uie we see in the taskflow of paddle is different from the uie in the paper, which can be regarded as a quick call application. After all, the fastest way to promote a job is to open it up to everyone with a fast and effective application.

2. HAE and UIE

Next, let's discuss the paddle version of UIE, which is the extractive model.
In 20 years, I saw a work called "history embedding", that is, the work of historical encoding. I remember that it was also done by the Baidu team. Later, I also did a little work along this line of thought.

The so-called historical encoding is to add an additional embedding layer on the input side of the encoder, that is, in addition to token-emb and seg-emb, add a history-emb (why not mention pos-emb, because pos-emb is in the position of attention), and then with such an embedding layer, the model can perceive which positions have appeared as historical arguments.

The input text of the model is also given in the form of QA: [CLS] trigger word xxx, where is the place where this xx event happened? [SEP]Original text[SEP] .

Seeing this structure, is it very similar to prompt? So I have always believed that the so-called prompt existed before last summer, but it was not given this name, and the prompt is not as magical as it is boasted.

Since time has passed for a long time, I only vaguely remember some work content at that time, and the review is as follows:

(1) keras version
Since the transformers module was not widely used at that time, I still used bert4keras, so I changed it to the keras version according to hae;
(2) Validity verification
The validity of hae is verified on the duee dataset, and the argument F1 of event extraction seems to have improved by three or four points;
(3) Question form verification
The first half of the input text, that is, the part of the question, does not need to be a question, although if it is a question, it may be more in line with the situation of the model during pre-training. At that time, I tried a lot of questioning methods, and also tried to only input trigger words and event types. As a prompt, I found that it had little effect on the model effect. So I guess that in finetune, the model captures some syntactic structure information. Essentially, it learns the relationship between the vocabulary at the position of prompt trigger word and the real trigger word in the original text, and then captures the characteristics of specific arguments related to the trigger word in the original text (no matter which argument the penalty word position in history emb must be 1), so the form of the question sentence is actually not critical;
(4) Zero-shot verification
After the duee verification, I verified the zero-shot capability of the model, trying to let the model discover some event types and argument types that were not in the schema, and the result was successful. For example, the training set does not include the event type of snow disaster, but given the corresponding question sentence and input custom arguments, the correct argument can be successfully extracted. Similarly, some other event types and argument types have also been tried. In summary, it is found that if the event type or argument is similar to the training schema, for example, , , etc., it can be extracted relatively smoothly, but if it is far from the existing event type and argument type in the schema, it **者will 被**者be **时间difficult **等级;
(5) Relational extraction verification
After verifying the effectiveness of event extraction, I thought of relational extraction. Now that prompt extraction events can be successfully constructed based on trigger words and argument types, is it possible to extract relationships based on subject and relationship types? I then verified it on the duie dataset, and the conclusion is yes. This model can use prompt to extract relations, but the effect is not as good as relation extraction models such as Casrel at that time.
(6) Sequence labeling and double pointer
For decoding, I also tried sequence labeling and double pointer structures at that time. From the effect point of view, there is not much difference. Sequence labeling can be connected with a CRF, the effect is slightly improved, and the excessive sparseness of the double pointer will cause slow training. In addition, the data set used at that time did not encounter the situation of over-lap, so the method of sequence labeling was finally used for decoding.

Here we can find that this is not the idea of paddle version UIE. Two-stage generates a prompt, and then uses the prompt to capture the relationship between entities.

At that time, I probably remembered that when I got the effect of this model, I was very surprised. But on the one hand, the bert-base model I used at that time could only handle the length of 512, and the question took a certain length. In addition, in the Chinese scene, one character of bert-base occupies one token, which led to the poor practicality of the model in the actual application scene. Moreover, under the pipeline structure, several consecutive bert-base encoders are also quite heavy, so I did not continue to follow this idea. But now there are various distilled small models, as well as long-former and other pre-trained models that handle long sequences, these problems are no longer a problem. Take a look at the several models released by UIE today, from base to small, micro, nano, and the models can even be applied to mobile terminals.

In short, I didn’t follow this line of thought at the time, and it’s a pity to look back on it now, because I was just entering the industry at the time, I didn’t have much experience, and I didn’t have much confidence in my intuition.

3. Structured sequence generation

Speaking of the authentic UIE itself, when I saw the UIE paper for the first time, I immediately thought of another paper. I vaguely remembered that it was a text-to-event generation task, and what was generated was a structured event, so I decoded it into an event based on the generated structured sequence.

So when I was reading the thesis, I was reading it with a kind of repulsive mentality. How can someone else do the content that others have already done, but still post it? This is an idea that I had already had last year. I vaguely remember that I experimented with that article, but the effect was not very good, so I didn’t pay attention to it later. I just remembered that it seemed to be a paper on NAACL or EMNLP last year, but I couldn’t find it after searching around. It was not until yesterday that I found out embarrassingly that the article was also ACL, and the author was the author of UIE. . .

TEXT2EVENT：https://arxiv.org/pdf/2106.09232.pdf
UIE: https://arxiv.org/pdf/2203.12277.pdf

In addition, in the field of information extraction in the past two years, a lot of work has actually been done related to prompt generation. Taking event extraction as an example, the more representative work is gen-arg of the team of Mr. Ji Heng. It shows that the whole field is moving in this direction, so the proposal of UIE seems to be a kind of necessity in today's fire of prompt.

Speaking of this, we can find that the work of UIE was not proposed all at once this year, but a process of accumulation and precipitation. This series of research feels very in line with the normal thinking logic of most people, and comes down in one continuous line.

4. Is UIE a "big unification"?

Yes and no, I think so. Perhaps adding such a point of view can promote this model faster and allow more people to accept it.

First of all, for the two tasks of relationship extraction and event extraction, I think there is not much difference in essence. The difference lies in the particularity of trigger words as entities. If the two things themselves are consistent, the unity between them is not a "big" unity. Speaking of "unification", I think the idea of UIE is not as unified as OneIE. In OneIE, all entities and trigger words are regarded as nodes, and the relationship between entities and event arguments are regarded as edges between nodes, thus introducing a graph structure, which is also the graph structure, which endows OneIE with strong performance capabilities (but it is also because of the complexity of graph decoding that leads to a long reasoning time for OneIE). Today, OneIE is still strong in the field of information extraction, and has been used as a SOTA for comparison by many papers.

At the same time, I also think that if one day, a "unified" model appears in deep learning, it must be a graph model, just like we can't look at relationship extraction and event extraction separately, CNN and RNN, NLP and CV should not be viewed in opposition, including the transformer structure, which is essentially a graph model, but the connection method of the graph is different. For example, the RNN model is a sequential connection between nodes, and the CNN model is a neighborhood connection on a plane. Although my current research on graphical models is not in-depth, my intuition makes me willing to always believe this.

But from the perspective of the task form, we cannot deny that the structure of prompt+text unifies the form of the task, so that various tasks that we originally thought were different can be trained together.

But having said that, why do we have to pursue a so-called unified model? It is as strong as a transformer, and it is difficult to completely replace the structure of CNN and RNN in any task. Isn't this determined by the differences in the tasks themselves? We seek commonality, but we should not seek commonality for the sake of seeking commonality, ignoring objective differences. It is true that a unified model can gain more attention and applause, but we believe that all academic research must ultimately be applied and should be more pragmatic.

Even though I don't think UIE is not very innovative in terms of thought and structure, I still think it is a great job, especially a lot of pre-training experiments, which are very valuable. Moreover, UIE also provides a pragmatic, easy-to-use, and fast tool for the majority of NLP researchers, which seems to be the long-term pursuit of paddleNLP (in this regard, it has to be said that paddle's taskflow is more practical than transformers' pipeline).

The above are some thoughts brought to me by the recent UIE model. If readers think that what I said is wrong, you can point it out, or if you have some other opinions, welcome to discuss in the comment area.

Personally, I still prefer to write some application-oriented blogs, but some discoveries this time make me feel more interesting, so I can't help but record them. If you like my content, remember to like and follow.

(Miscellaneous talk) Some thoughts about UIE