ELMo interpretation (paper + PyTorch source)

ELMo concept is very early out, it should be the beginning of the 18 things. But I still hindsight, actually came out BERT or so after a long time, know there is such a thing. These two talented carefully read papers and source code in here to do some records, if there are no detailed local, welcome that -

文章目录
前言
一. ELMo原理
1. ELMo整体模型结构
2. 字符编码层
3. biLMs原理
4. 生成ELMo词向量
5. 结合下游NLP任务
二. PyTorch实现
1. 字符编码层
2. biLMs层
3. 生成ELMo词向量
三. 实验
四. 一些分析
1. 使用哪些层的输出?
2. 在哪里加入ELMo?
3. 每层输出的侧重点是什么?
4. 效率分析
五. 总结
传送门
前言
ELMo出自Allen研究所在NAACL2018会议上发表的一篇论文《Deep contextualized word representations》,从论文名称看,应该是提出了一个新的词表征的方法。据他们自己的介绍:ELMo是一个深度带上下文的词表征模型,能同时建模(1)单词使用的复杂特征(例如,语法和语义);(2)这些特征在上下文中会有何变化(如歧义等)。这些词向量从深度双向语言模型(biLM)的隐层状态中衍生出来,biLM是在大规模的语料上面Pretrain的。它们可以灵活轻松地加入到现有的模型中,并且能在很多NLP任务中显著提升现有的表现,比如问答、文本蕴含和情感分析等。听起来非常的exciting,它的原理也十分reasonable!下面就将针对论文及其PyTorch源码进行剖析,具体的资料参见文末的传送门。

这里先声明一点:笔者认为“ELMo”这个名称既可以代表得到词向量的模型,也可以是得出的词向量本身,就像Word2Vec、GloVe这些名称一样,都是可以代表两个含义的。下面提到ELMo时,一般带有“模型”相关字眼的就是指的训练出词向量的模型,而带有“词向量”相关字眼的就是指的得出的词向量。

一. ELMo原理
之前我们一般比较常用的词嵌入的方法是诸如Word2Vec和GloVe这种,但这些词嵌入的训练方式一般都是上下文无关的,并且对于同一个词,不管它处于什么样的语境,它的词向量都是一样的,这样对于那些有歧义的词非常不友好。因此,论文就考虑到了要根据输入的句子作为上下文,来具体计算每个词的表征,提出了ELMo(Embeddings from Language Model)。它的基本思想,用大白话来说就是,还是用训练语言模型的套路,然后把语言模型中间隐含层的输出提取出来,作为这个词在当前上下文情境下的表征,简单但很有用!

1. ELMo整体模型结构
对于ELMo的模型结构,其实论文中并没有给出具体的图(这点对于笔者这种想象力极差的人来说很痛苦),笔者通过整合论文里面的蛛丝马迹以及PyTorch的源码,得出它大概是下面这么个东西(手残党画的丑,勿怪):


假设输入的句子维度为B∗W∗C B * W * CB∗W∗C,这里的 B BB 表示batch_size,W WW 表示num_words,即一句话中的单词数目,在一个batch中可能需要padding,C CC 表示max_characters_per_token,即每个单词的字符数目,这里论文里面用了固定值50,不根据每个batch的不同而动态设置,D DD 表示projection_dim,即单词输入biLMs的embedding_size,或者理解为最终生成的ELMo词向量维度的1/2 1 / 21/2。

从图里面看,输入的句子会经过:

Char Encode Layer: First, it means that after a character encoding layer, because ELMo is actually based on char, so it will first be encoded for each word in all char, resulting in the word. So after layer through dimension out of character encoding B * W * DB * W * DB * W * D, that is, we know for a sentence at the word level coding dimension of.
biLMs: then said they would pass the sentence biLMs, namely two-way modeling language model, internal training is actually two separate forward and reverse language model, and after that characterize splicing, the resulting output dimension as (L +1) * B * W * 2D (L + 1) * B * W * 2D (L + 1) * B * W * 2D, + 1 plus the initial fact embedding layer, a bit like residual, detail later mentioned "biLMs" section.
Scalar Mixer: Immediately after, the individual layers have been characterized biLMs, will go through a mixed layer, these layers indicates that it will be a linear integration (hereinafter the "word vector generation ELMo" section will be described in detail), obtained the final ELMo vector dimension B * W * 2D B * W * 2DB * W * 2D.
Here are just a comprehensive view of ELMo model from the overall structure of each module which is still very ignorant force? It does not matter, let's analyze one by one to:

2. Character coding layer
This layer i.e., "Char Encode Layer", its dimensions are input B * W * CB * W * CB * W * C, the output is the dimension B * W * DB * W * DB * W * D , the source view, FIG its structure a long way:


Painting a bit chaotic, everyone will look forward to ~

First, the input sentence is to reshape BW * C BW * CBW * C, because it is processed for all of the char. Then it goes through several layers are as follows:

Char Embedding: This is the normal embedding layer performs coding for each char, char practically all vocabulary is about 262, wherein the char unicode encoding is 0-255, 256-261 are six <bow> ( start word), <eow> (end of word), <bos> (beginning of the sentence), <eos> (end of a sentence), <pow> (filled symbol words) and <pos> (filled symbols sentences ). Visible vocabulary is still relatively small, and without OOV appear. Embedding dimension parameter here is 262 (num_characters) * d (char_embed_dim) 262 (num \ _characters) * d (char \ _embed \ _dim) 262 (num_characters) * d (char_embed_dim). D DD d dd Note that here and in the previous section are two concepts, d dd character represents the embedding dimension, and D DD represents the embedding dimension word, see later mapping relationship between them . This section is output dimension BW * C * d BW * C * dBW * C * d.
Multi-Scale convolutional layer: here is the convolution of different scale layers, attention is extended in width rather than depth, i.e., the input is the same, the difference between the convolution kernel_size its size and channel_size different for capturing information between different n-grams, this is actually a model of the structure modeled TextCNN. Suppose there are m mm layer of such convolution, from which kernel_size k1, k2, ..., km k1 , k2, ..., kmk1, k2, ..., km, for example 1,2,3,4, 5,6,7 such that channel_size from d1, d2, ..., dm d1 , d2, ..., dmd1, d2, ..., dm, such 32,64,128,256,512,1024 this. Note: this is a one-dimensional convolution of a convolution, i.e. only do the convolution sequence length. And similar image processing, after convolution, would be MaxPooling after pooling, the main purpose here is, through the length of the sequence in front of convolution often inconsistent, the latter can not be merged, so here MaxPooling dimension in sequence, in fact, take a word in the largest representation of the char representation as a whole word. Finally, through the active layer, even if the end of this step. The channel_size of different sizes, the output of this step are the dimensions BW * d1, BW * d2, ..., BW * dm BW * d1, BW * d2, ..., BW * dmBW * d1, BW * d2 , ..., BW * dm.
Concat layer: Previous results are m different dimensions of the matrix, in order to facilitate post-processing, where it is spliced in the last one-dimensional, and after which reshape back word level dimension B * W * (d1 + d2 +. .. + dm) B * W * (d1 + d2 + ... + dm) B * W * (d1 + d2 + ... + dm).
Highway layer: Highway (see: https: //arxiv.org/abs/1505.00387) is modeled residual image approach, often used in the field of NLP, implementation code to see inside, see formula below this layer to achieve: in fact, + residue is connected to an all implementations, but there is also a need for a gate element-wise matrix x xx and f (a (x)) f (a (x)) f (a (x)) for transformation. It should go through this layer Highway H HH layer, the output remains dimensions B * W * (d1 + d2 + ... + dm) B * W * (d1 + d2 + ... + dm) B * W * (d1 + d2 + ... + dm).
y = g * x + (1 -g) * f (A (x)), g = Sigmoid (B (x)) y = g * x + (1 - g) * f (A (x)), g = the Sigmoid (B (X))
Y = G * X + (. 1-G) * F (A (X)), G = the Sigmoid (B (X))

Linear mapping layer: After the previous calculation, the obtained vector dimension d1 + d2 + ... + dm d1 + d2 + ... + dmd1 + d2 + ... + dm often relatively long, where the extra layer of mapping Linear , mapped to a dimension D DD, as embedding word into subsequent layers, the output here is the dimension B * W * DB * W * DB * W * D.
3. biLMs principle
ELMo mainly based on biLMs (bidirectional language model), the following start mathematically explain what is biLMs.

In particular, given a sequence of N NN token number (T1, T2, ..., tN) (T_l, T_2, ..., t_n) (T
. 1 , T 2 , ..., T N ), forward language model (typically multilayered LSTM the like) for calculating a given probability tokens front case the current token, namely:







p(t1,t2,...,tN)=∏Nk=1p(tk∣t1,t2,...,tk−1) p(t_1, t_2, ..., t_N) = \prod_{k=1}^{N} p(t_k | t_1, t_2, ..., t_{k-1})
p(t
1

,t
2

,...,t
N

)=
k=1

N

p(t
k

∣t
1

,t
2

,...,t
k−1

)

In each position k kk, the output of each layer in the model are associated with a context characterized LMK → H, J \ _ overrightarrow {H {K}, the LM {J}} ^
H

K, J
the LM , where j = 1, ..., L j = 1 , ..., Lj = 1, ..., L represents the first layers. LMK top output → H, L \ overrightarrow {H} _ {K, L} ^ {} the LM H K, L the LM for the next prediction a token: TK + K +. 1. 1 T_ {T} K +. 1 .










Similarly, the reverse model of language training and positive, in that it is in turn input, that is to calculate the probability of the case after the current token tokens given:

p(t1,t2,...,tN)=∏Nk=1p(tk∣tk+1,tk+2,...,tN) p(t_1, t_2, ..., t_N) = \prod_{k=1}^{N} p(t_k | t_{k+1}, t_{k+2}, ..., t_N)
p(t
1

,t
2

,...,t
N

)=
k=1

N

p(t
k

∣t
k+1

,t
k+2

,...,t
N

)

Similarly, the reverse LM at each position k kk, each layer will also generate a context-sensitive characterization LMK ← H, J \ _ overleftarrow {H} {K, LM} {J} ^
H

K, J
LM .

ELMo with biLMs is combined with forward and reverse language model, the goal is to maximize the likelihood of the following:

∑Nk=1(logp(tk∣t1,...,tk−1;Θx,Θ→LSTM,Θs)+(logp(tk∣tk+1,...,tN;Θx,Θ←LSTM,Θs)) \sum_{k=1}^N(\log p(t_k | t_1, ..., t_{k-1}; \Theta_x, \overrightarrow \Theta _{LSTM}, \Theta_s) + (\log p(t_k | t_{k+1}, ..., t_N; \Theta_x, \overleftarrow \Theta _{LSTM}, \Theta_s))
k=1

N

(logp(t
k

∣t
1

,...,t
k−1


x

,
Θ

LSTM


s

)+(logp(t
k

∣t
k+1

,...,t
N


x

,
I.

LSTM , TH s ))




Inside the [theta] x \ Theta_xΘ
X , [theta] s \ Theta_sΘ S and LSTM → [Theta] \ overrightarrow \ Theta LSTM {_} [Theta] LSTM and LSTM ← [Theta] \ overleftarrow \ Theta LSTM {_} [Theta] LSTM are embedded in the word, the output layer as well as positive and negative LSTM parameters (before Softmax).














As can be seen, in fact, they were the equivalent of trained forward and reverse two LM. It seems only to train separately, because LM is not training in both directions.

Schematic, then, is this multi-layered BiLSTM the following way:

Here h hh represents hidden_size LSTM unit, may be relatively large, such as D = 512, h = 4096 D = 512, h = 4096D = 512, h = 4096 so. Therefore, after each layer is also a need to map from the dimension Linear layer is h hh D DD, and then input to the next layer. The final output is the output of each of all the embedding layer and an output, for Stack, and the output of which each layer is the forward and reverse concat output of each timestep, and thus the final output dimension (L +1) * B * W * 2D (L + 1) * B * W * 2D (L + 1) * B * W * 2D, where L + 1 L + in 1L + 1 +1 + 1 + 1 It represents one embedding TV drama output, which will be copied into two to output each dimension consistent with biLMs.

Generating a word vector ELMo 4.
After a biLMs layer obtained was characterized by dimensions (L + 1) * B * W * 2D (L + 1) * B * W * 2D (L + 1) * B * W * 2D, then we need to generate the final vector ELMo!

TK t_kt For each token
K biLMs, L LL layer, out of the generated characterization has 2L + 1 2L + 12L + 1 Items, as follows equation:

Rk={xLMk,h→LMk,j,h←LMk,j∣j=1,...,L}={hLMk,j∣j=0,..,L} R_k = \{x_k^{LM}, \overrightarrow{h}_{k,j}^{LM}, \overleftarrow{h}_{k,j}^{LM} | j = 1, ..., L\} = \{h_{k,j}^{LM} | j = 0, .., L\}
R
k

={x
k
LM

,
h

k,j
LM

,
h

k,j
LM

∣j=1,...,L}={h
k,j
LM

∣j=0,..,L}

HLMk Here, H_ {K 0, 0}} ^ {the LM H
K, 0
the LM is embedding the output word, hLMk, j = [h → LMk, j; h ← LMk, j] h_ {k, j} the LM} = {^ [\ overrightarrow {H} _ {K, the LM J} ^ {}; \ overleftarrow {H} _ {K, the LM {J}} ^] H K, J the LM = [ H K, J the LM ; H K, J the LM ] shows the results of the forward and reverse outputs each layer splice.

















For these characterizations, using the following formula papers they do a scalar mixer:

ELMotaskk=E(Rk;Θtask)=γtask∑Lj=0staskjhLMk,j ELMo_{k}^{task} = E(R_k; \Theta^{task}) = \gamma^{task} \sum_{j=0}^L s_j^{task} h_{k,j}^{LM}
ELMo
k
task

=E(R
k


task
)=γ
task

j=0

L

s
j
task

h
k,j
LM

Here staskj s_j Task} {S ^
J
Task probability value is a softmax, scalar parameter γtask \ Gamma gamma] ^ {} Task Task is a vector of the entire zoom ELMo on the scale. These two parts are used as parameters to learn, they have different values for different tasks.



Meanwhile inside the paper also mentioned that there may be a large difference between the output distribution of each layer, so sometimes before the fusion line, do for each output a Layer Normalization, which is consistent with Transformer inside.

After subsequent Mixer Scalar vector dimension is B * W * 2D B * W * 2DB * W * 2D, that is generated ELMo word vector can be used for subsequent tasks.

5. interacting with many NLP tasks
usually ELMo model will be carried out on a large corpus of pre-training, because it is a language training model, does not require any labels, plain text can thus be used where large corpus, this advantage is very obvious. After completion of training ELMo model, you can enter a new sentence, get ELMo word vectors for each word of the sentence in the current context in which the.

Papers mentioned in the training and found that the use of suitable dropout and L2 when the ELMo model will enhance the effect.

At this time, the word vector can access the NLP downstream task, such as Q, emotion analysis. From the location of the access point of view, can enter the downstream embedding NLP tasks themselves stitched together, and its output may be spliced ​​together. From the model is a fixed point of view, they can all be ELMo word vector previously extracted, namely fixed ELMo model without their training, incidentally, can also fine-tune the model in training ELMo downstream NLP tasks. In short, it is very convenient, can be inserted into any place you want to insert make additions.

Two. PyTorch achieve
here refer mainly allennlp inside and ELMo itself relevant part, related to biLMs model implementation, and ELMo reasoning section, the core section lists only, the minutiae of the code is not listed. As for how to combine downstream of NLP tasks, and fine-tune, but also need the reader to explore and practice, are not described here!

1. Character coding layer
here is to achieve the aforementioned Char Encode Layer.

The first is the realization of multi-scale CNN:

# multi-scale CNN

# 网络定义
for i, (width, num) in enumerate(filters):
conv = torch.nn.Conv1d(
in_channels=char_embed_dim,
out_channels=num,
kernel_size=width,
bias=True
)
self.add_module('char_conv_{}'.format(i), conv)

# forward函数
def forward(sef, character_embedding)
convs = []
for i in range(len(self._convolutions)):
conv = getattr(self, 'char_conv_{}'.format(i))
convolved = conv(character_embedding)
# (batch_size * sequence_length, n_filters for this width)
convolved, _ = torch.max(convolved, dim=-1)
convolved = activation(convolved)
convs.append(convolved)
# (batch_size * sequence_length, n_filters)
token_embedding = torch.cat(convs, dim=-1)
return token_embedding
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
. 17
18 is
. 19
20 is
21 is
22 is
23 is
24
25
and the highway is achieved:

# HighWay

# 网络定义
self._layers = torch.nn.ModuleList([torch.nn.Linear(input_dim, input_dim * 2)
for _ in range(num_layers)])

# forward函数
def forward(self, inputs):
current_input = inputs
for layer in self._layers:
projected_input = layer(current_input)
linear_part = current_input
# NOTE: if you modify this, think about whether you should modify the initialization
# above, too.
nonlinear_part, gate = projected_input.chunk(2, dim=-1)
nonlinear_part = self._activation(nonlinear_part)
gate = torch.sigmoid(gate)
current_input = gate * linear_part + (1 - gate) * nonlinear_part
return current_input
1
2
3
4
. 5
. 6
. 7
. 8
. 9
10
. 11
12 is
13 is
14
15
16
. 17
18 is
. 19
2. biLMs layer
which is actually a part of training BiLSTM two different directions, then output after stitching can be directly mapped, as follows :( unidirectional single layer, for example)

# Network definition
# input_size: embedding dimension input
# hidden_size: hidden state of the input and output dimensions
# cell_size: LSTMCell the internal dimension.
# General input_size = hidden_size = D, hidden_size namely h.
= torch.nn.Linear self.input_linearity (input_size, cell_size *. 4, BIAS = False)
self.state_linearity = torch.nn.Linear (hidden_size, cell_size *. 4, BIAS = True)
self.state_projection torch.nn.Linear = ( cell_size, hidden_size, bias = False)

# forward函数
def forward(self, inputs, batch_lengths, initial_state):
for timestep in range(total_timesteps):

# Do the projections for all the gates all at once.
# Both have shape (batch_size, 4 * cell_size)
projected_input = self.input_linearity(timestep_input)
projected_state = self.state_linearity(previous_state)

# Main LSTM equations using relevant chunks of the big linear
# projections of the hidden state and inputs.
input_gate = torch.sigmoid(projected_input[:, (0 * self.cell_size):(1 * self.cell_size)] +
projected_state[:, (0 * self.cell_size):(1 * self.cell_size)])
forget_gate = torch.sigmoid(projected_input[:, (1 * self.cell_size):(2 * self.cell_size)] +
projected_state[:, (1 * self.cell_size):(2 * self.cell_size)])
memory_init = torch.tanh(projected_input[:, (2 * self.cell_size):(3 * self.cell_size)] +
projected_state[:, (2 * self.cell_size):(3 * self.cell_size)])
output_gate = torch.sigmoid(projected_input[:, (3 * self.cell_size):(4 * self.cell_size)] +
projected_state[:, (3 * self.cell_size):(4 * self.cell_size)])
memory = input_gate * memory_init + forget_gate * previous_memory

# shape (current_length_index, cell_size)
pre_projection_timestep_output = output_gate * torch.tanh(memory)

# shape (current_length_index, hidden_size)
timestep_output = self.state_projection(pre_projection_timestep_output)

output_accumulator[0:current_length_index + 1, index] = timestep_output

# Mimic the pytorch API by returning state in the following shape:
# (num_layers * num_directions, batch_size, ...). As this
# LSTM cell cannot be stacked, the first dimension here is just 1.
final_state = (full_batch_previous_state.unsqueeze(0),
full_batch_previous_memory.unsqueeze(0))

return output_accumulator, final_state
. 1
2
. 3
. 4
. 5
. 6
. 7
. 8
. 9
10
. 11
12 is
13 is
14
15
16
. 17
18 is
. 19
20 is
21 is
22 is
23 is
24
25
26 is
27
28
29
30
31 is
32
33 is
34 is
35
36
37 [
38 is
39
40
41 is
42 is
43 is
44 is
45
. 3 . ELMo generated word vectors
this section is the Scalar Mixer, which code is as follows:

# 参数定义
self.scalar_parameters = ParameterList(
[Parameter(torch.FloatTensor([initial_scalar_parameters[i]]),
requires_grad=trainable) for i
in range(mixture_size)])
self.gamma = Parameter(torch.FloatTensor([1.0]), requires_grad=trainable)

# forward函数
def forward(tensors, mask):

def _do_layer_norm(tensor, broadcast_mask, num_elements_not_masked):
tensor_masked = tensor * broadcast_mask
mean = torch.sum(tensor_masked) / num_elements_not_masked
variance = torch.sum(((tensor_masked - mean) * broadcast_mask)**2) / num_elements_not_masked
return (tensor - mean) / torch.sqrt(variance + 1E-12)

normed_weights = torch.nn.functional.softmax(torch.cat([parameter for parameter
in self.scalar_parameters]), dim=0)
normed_weights = torch.split(normed_weights, split_size_or_sections=1)

if not self.do_layer_norm:
pieces = []
for weight, tensor in zip(normed_weights, tensors):
pieces.append(weight * tensor)
return self.gamma * sum(pieces)

else:
mask_float = mask.float()
broadcast_mask = mask_float.unsqueeze(-1)
input_dim = tensors[0].size(-1)
num_elements_not_masked = torch.sum(mask_float) * input_dim

pieces = []
for weight, tensor in zip(normed_weights, tensors):
pieces.append(weight * _do_layer_norm(tensor,
broadcast_mask, num_elements_not_masked))
return self.gamma * sum(pieces)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31 is
32
33 is
34 is
35
36
37 [
three experiments
here mainly include some bound ELMo on an actual downstream task performance, are SQUAD (Q task), SNLI (text implies), SRL (semantic role labeling), COREF ( coreference resolution), the NER (NER), and SST-5 (emotion analysis task), with the following results:

Visible, basically at a lower baseline case, after using the ELMo, before reaching effects beyond the SoTA!

IV. Some analysts
paper, the authors have done some interesting analysis, advantages and characteristics of the spy ELMo from all angles. such as:

1. What are the output layer to use?
It explores the effects of using different layers biLMs brought, and the use of different numbers of L2 norm weights as follows:


Last Only inside this means that except the topmost biLM output, λ \ lambdaλ refers to the L2 norm of heavy weight, all visible layers generally better results, and the lower the L2 norm effects are good, so that each layer represents its will tend to differ when heavy weights L2 norm is larger, the model parameter values ​​will converge all layers, each output will cause the model to converge.

2. Where join ELMo?
As mentioned earlier, you can add ELMo vector when the input and output, the authors compared different both:


In the question and answer text and contains tasks, while adding ELMo is in better input and output, and in the semantic role labeling task, it is better to join only input. Paper guess the reason may be because the two previous mandates, need to use attention, and when the output of adding ELMo, let see output ELMo direct attention of the entire task would be beneficial. In the semantic role labeling on, task-related context characterized biLMs than general-purpose output is more important than others.

3. What is the focus of each output?
Papers obtained by experiment, in biLMs low-rise, more focused on the characterization of such grammatical features such as speech, etc., and characterization of high-level in more focused on semantic features. For example, the following results:


Left semantic disambiguation task, the task is to the right of speech tagging, semantic disambiguation seen in the above tasks, the effect of using the second layer is better than the first layer; speech tagging results in the above tasks, using the first layer but better than the effect of using the second layer.

Overall, the effect of all the layers or the use of the output will be better, specific weight make the model yourself to learn enough.

4. Efficiency Analysis
In general, with a network of pre-training model will tend to converge faster, but can also use less data sets. Thesis experimentally verify this:


For example, in the task SRL, a model ELMo using only 1% of the data set can be achieved without the use of 10% in the effect of using the model data set ELMo!

V. summary
ELMo has excellent properties as follows:

Context-sensitive: depending on the use of each word represents its entire context.
Depth: word represents a combination of all the depth pre-trained neural network.
Based on characters: ELMo representation based purely on character, and then again after CharCNN as a representation of the word, to solve the problem of OOV, and vocabulary input is small.
Resource-rich: a complete source code, pre-trained model parameters and detailed calls and examples, is a good project for the benefit of the party's hand! And: there are people dedicated to achieve a multi-language, it seems to be engaged in HIT, poke here to see the project.
Portal
thesis: https: //arxiv.org/pdf/1802.05365.pdf
Project Home: https: //allennlp.org/elmo
Source: https: //github.com/allenai/allennlp (PyTorch, on the part of the stamp ELMo here)
https://github.com/allenai/bilm-tf (TensorFlow)
multi-language: https: //github.com/HIT-SCIR/ELMoForManyLangs (multi-language ELMo HIT CoNLL evaluation, and Traditional Chinese)
- --------------------
author: MagicBubble
source: CSDN
original: https: //blog.csdn.net/Magical_Bubble/article/details/89160032
copyright: This article is bloggers original article, reproduced, please attach Bowen link!

Guess you like

Origin www.cnblogs.com/jfdwd/p/11088811.html