Advertising industry in those interesting series 6: BERT line of ALBERT optimization theory and practice of project (attached github)

Abstract: BERT good effect and because two major advantages for a wide range, so a milestone in the field of NLP. The actual project is mainly used to do BERT text classification tasks, in fact, to play a text label. Because the original ecological BERT pre-training model easily a few hundred or even on gigabit size, model train speed is very slow, for BERT model line of very unfriendly. Benpian study currently more fire BERT latest derivative products ALBERT BERT is done online service. ALBERT parameter reduction techniques to reduce memory consumption and thus ultimately to improve the training speed BERT, and are among the best in the major benchmarks, can be described as run fast, but also run well. Hope will be of interest to require BERT line of slightly smaller partners help.


Catalog
01 Project Background
02 from BERT to ALBERT
03 miles Step One: Xianpao model through
more than 04 practice classification task
summary




01 Project Background

original ecological BERT pre-training model easily a few hundred or even on gigabit size, speed training very slow for the model line of very unfriendly. In order to achieve BERT model line problem, in fact, it is how fast a good training model, the research currently ultra fire BERT latest derivative products ALBERT project can solve the above problems.

ALBERT by paper: proposed "ALBERT A Lite BERT For Self- Supervised Learningof Language Representations" to come. Increase the size of the pre-training model under normal circumstances can improve performance model in downstream tasks, but because of "restrictions GPU / TPU memory, longer training time and unexpected model degenerates" and other issues, the authors proposed ALBERT model.

Paper Download:


ALBERT popular understanding is that a smaller number of parameters BERT lightweight model. ALBERT BERT is the latest derivative products, although lightweight, but did not discount the effect, in the main benchmarks are among the best.

BERT from 02 to ALBERT

appear background 1. ALBERT

since the depth of learning detonated field of computer vision, a way to improve model performance of the simplest and most effective way to increase network depth. Under the figure to take pictures classification tasks, for example, can be seen as the network continues to increase the number of layers, the effect of the model will be a great upgrade:

A network model to enhance the effect increasing the number of layers in FIG.


The same situation occurs on the BERT, such as the network becomes deeper wider lifting effect model obtained:

FIG 2 BERT model as the network becomes wider depth effect is to enhance the


But the network becomes deeper widened bring a significant problem: Parameter explosion. Here the amount of BERT parameters scale models look different parameters of change "fat" of the road:

Figure 3 BERT parameters explosion


How to do, so BERT is not so "fat", but the effect is still good academic research focus is how one of the BERT line of priorities. It is also ALBERT do.

2. BERT "fat" Where

want BERT thinner, first know the "meat" long where. BERT using Transformer as a feature extractor, which is the source of BERT parameters. Before the advertising industry in those interesting series 4: Comments from supporting role to the C-bit half-baked Transformer very in-depth analysis of the Transformer, small partners who are interested can look back.

Transformer main source of parameter blocks: a first block mapping module token embedding parameter accounts for more than 20%, the second block is attention to the feedback layer and the front layer, the FFN, parameter accounts for more than 80%.

A configuration diagram of FIG. 4 Transformer source parameters and BERT

3. ALBERT optimization strategies

strategy First, the embedding parameters factorization (Factorized embedding the PARAMETERIZATION)

BERT mapped one-hot vector to a high dimensional space words, the parameter is the amount of O (VXH), ALBERT factorization first embodiment is used the one-hot vector word mapped into low dimensional space (size E), and then mapped back to a high-dimensional space (size H), so that the parameters used are only O (VXE + EXH), if E << when parameter H amount will be reduced a lot. Here reduce some of BERT parameter token embedding the first part of the above said, it is to a certain extent.

Causes may be reduced by the amount of factoring the parameters are context independent token embedding is converted into dense one-hot vector by vector. FFN and the second portion of attention as a hidden layer is context-dependent, contain more information. So by doing a less than H E intermediary of the one-hot vector word to go through a low-dimensional embedding matrix, then mapped back to high-dimensional embedding matrix it is feasible. The red box shows the factorization of parts:

FIG 5 factorization parameters to reduce the amount of

View token embedding factorization effect: The overall reduced 17% of the model parameters, but only the effect of reducing the model less than 1%.

FIG 6 factorization parameters to reduce the effect of the amount of


Parameters (Cross-layer parameter sharing) between the two strategies, shared layers

by layers Transformer parametric analysis showed similar visual parameters of each layer, more attention is allocated in [the CLS] on the token and diagonal , so you can use cross-layer parameter sharing scheme.

Generally speaking, cross-layer parameter corresponding to the shared Transformer encoder structure there are two options: one is a shared parameter attention module, the other is the network layer parameters FFN feedforward neural shared. Specific results as shown below:

7 using the shared parameters on the parameter model and the amount effect

When the low-dimensional space is mapped to E = 768, and the comparison is not shared parameters shared parameters FFN layer can be seen, the parameter is reduced by nearly 50%, which is mainly due to the effect of the model leads to decrease. The shared attention layer parameters is less impact on the model results.

Strategy Three, build self-learning task - a coherent sentence prediction

(Next Sentence Prediction) mission through the transformation of NSP, pre-training mission to improve the enhanced continuous learning of the sentence.

Advertising industry in those interesting series 3: NLP in star BERT key explained BERT model, which refers to BERT outstanding achievements in recent years the field of NLP master of innovation itself is mainly random shielded language model Masked LM and the next sentence predict Next Sentence prediction. Interested partners can go back the next little better looking.

NSP task itself is a binary classification task, the purpose is to predict whether the two sentences are consecutive statements. NSP actually contains two sub-tasks, which are subject of prediction and forecasting consistency relations. NSP task of selecting the same document in two consecutive sentences as a positive sample, choose a different sentence as a document of negative samples. Because from different documents, the difference can be very large. In order to enhance the ability of the model to predict continuous sentence, ALBERT proposed a new task SOP (SenteceOrder Prediction), a positive sample acquisition mode and the same NSP, the statement will be the order of the negative samples positive samples reversed.

SOP and shows the effect of NSP as shown below:

FIG 8 SOP and show the effect of NSP

As can be seen from the figure the task can not be predicted NSP SOP type of task, it can be predicted SOP NSP task. Overall, the model is also superior to the effect of SOP task NSP task.

Strategy 4, to remove Dropout

Dropout mainly to prevent over-fitting, but the actual MLM generally not easy to over-fitting. Dropout may also be removed so as to effectively enhance fewer intermediate variable during training model memory utilization.

FIG 9 dropout effects Effect

Other strategies: width and depth of the impact on the network model effect

1. Network depth is deeper the better
contrast ALBERT can be found at different depths in effect: With the deepening layers, model the effects of different NLP tasks there is a certain upgrade. But this situation is not absolute, but will decrease the effect of some tasks.

Effect of depth of the network 10 of FIG.

2. if the width of the wider network better
contrast depth model where the effect of the different networks widths ALBERT-large models can be found 3: Effect of similar width and depth of the model, with increasing width the network model is that the effects of different tasks NLP there are certain upgrade. Some tasks will be the effect of the presence of the decline.

Effect of the network 11 of FIG width

Overall, the essence is to use the parameter ALBERT reduction technologies to reduce memory consumption and thus ultimately to improve the training speed BERT, mainly to optimize the following aspects:

  • By factoring and sharing parameter between the layers to reduce the number of model parameters to improve the efficiency parameter;
  • By SOP alternative NSP, the ability to enhance the continuity of learning sentences to enhance the ability of self-supervised learning tasks;
  • Can save a lot by eliminating dropout temporary variables, the model effectively enhance the training process memory utilization, improve the efficiency of the model, reducing the size of the training data.

03 Miles The first step: start and through the model

because the actual main project is to identify the Chinese, it is mainly the use of ALBERT Chinese version ALBERT_zh, github project Address: .

I remember before seen a picture very interesting, can be a good description of my feelings at the moment:

12 through FIG. Step Model Xianpao

For me this kind of "ism's", the then regressed model first step is always the start and through it, as for the optimization of put off. It runs through not only improve self-confidence, the most practical effect is that we can quickly implement the project on-line. Because I need to complete text classification task, so by the above address to download github project, on the cluster jump to the next albert_zh directory, execute the command sh run_classifier_lcqmc.sh can be up and running. Because the project is not a sentence classification tasks, only a relationship similar sentence judgment task, so the start and through this task, then the latter according to the task code to change on the line.

run_classifier_lcqmc.sh script generally divided into two blocks, the first piece is a model running preparatory work, the second block is the model runs. The following are a model, which involves acquiring data, the model pre-trained, the model parameters related to equipment and the like.

13 model runs Preparations

The second block is responsible for running the model, mainly python command to run the program and related parameters needed to configure.

14 model diagram


Under summary, the focus here told a judge sentence relations task how to run a successful ALBERT_zh itself provides. This demo is real and our Chinese project of this classification task is very similar tasks, here is through the transformation of this script and code execution to complete our actual text classification project.

More than 04 practice classification tasks

reconstruction projects github at the following address: .

The original project fork down, and here I added two files run_classifier_multiclass.sh and run_classifier_multiclass.py. This is the task script and code for performing text classification. In fact, the principle of transformation is relatively simple, there is roughly on the next.

Data format to determine sentence relations task of the project was originally provided: id, text_a, text_b, label , the task is actually to determine the two words in the end there is no relationship. Positive samples for example as follows:

TEXT_A: Jackie endorsement of the legendary knife What fun?
text_b: Jackie had other legends also speak it?
label: 1

negative sample may be this:
TEXT_A: Jackie endorsement of the legendary knife What fun?
text_b: Chengdu, which spots the most fun?
label: 0

By the above two examples of positive and negative samples we should understand what is the relationship between a sentence judgment task, in fact, have supervised learning of classification tasks. We mainly do the actual text classification project by BERT, identifying a word belongs to which tag corresponding to the above tasks is actually only text_a, label. Because the same type of task, the policy focus is to modify the code so there text_b analysis section of the code. Specific scripts and code modification is to say above two documents, there is little need partners and seeking. It should be noted that the original data file is tsv format, here is my csv format, data entry a little bit different from the other models did not move.


Summary of

the actual project needs the line of BERT need to make faster and better training model, so through research using the current BERT latest derivative products ALBERT. ALBERT through factoring and share parameters between the layers to reduce the number of model parameters to enhance the efficiency parameters; SOP substitution by NOP, the ability to enhance the continuity of learning sentences to enhance the ability of self-supervised learning tasks; can save a lot by eliminating dropout temporary variables, the model effectively enhance the training process memory utilization, improve the efficiency of the model, reducing the size of the training data. The last sentence in the project to determine the relationship between the task transformed into text classification task of our actual projects for real business needs. Can be said to have a theory, help small partners understand why ALBERT training faster, the results were good. There are also practical, if you need to do to use ALBERT text classification task directly with the transformation of my good scripts and code up and running on the line.

Articles like this type of small partners can focus on my micro-channel public number: data by the pickup. Any dry I will first be released in the public micro-channel number, also known almost in sync, headlines, Jane books, csdn platforms. Small partners also welcome more exchanges. If you have questions, I can always Q micro letter public in the number of Kazakhstan.


Guess you like

Origin www.cnblogs.com/wilson0068/p/12444111.html