Keras实现BiLSTM+CRF字符级序列标注

BiLSTM即可实现分词或命名实体标注等序列标注任务，单独的CRF也可以很好的实现。但因为单独LSTM预测出来的标注可能会出现（I-Organization-> I-Person，B-Organization - > I-Person）这样的问题序列，所以需要搞一个LSTM + CRF的混合模型。
这种错误在CRF中是不存在的，因为CRF的特征函数的存在就是为了对输入序列观察，学习各种特征，这些特征就是在限定窗口尺寸下的各种词之间的关系。
将CRF接在LSTM网络的输出结果后，让LSTM负责在CRF的特征限定下，依照新的loss function，学习出新的模型。

依赖库的安装：

keras-contrib
keras-contrib库是python深度学习库Keras的官方扩展库。它包含额外的层、激活、丢失函数、优化器等，这些在Keras本身中还不可用。所有这些附加模块都可以与核心Keras模型和模块一起使用。比如实现了CRF层。

pip install git+https://www.github.com/keras-team/keras-contrib.git

或者

git clone https://www.github.com/keras-team/keras-contrib.git
cd keras-contrib
python setup.py install

具体实现

1、数据预处理

这里采用的数据是一个字符对应一个label，其格式为：
data.txt

四月网络理政平台累计访问量二十四万八千七百三十六人次 …
人民网成都五月五日电昨日成都住房公积金管理中心发布通知 …

label.txt（与真实label值不同，此处仅举例说明格式）

1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 …
10 9 8 7 6 5 4 3 2 1 10 9 8 7 6 5 4 3 2 1 …

对字符和标签进行编码，即需要获取全部字符集合，然后按字频进行编号，建立字符-编号，编号-字符的索引，标签同理。再根据句子长度的分布和平均值最大最小值等设定最大句子长度，然后进行padding。

datafile = open('data.txt','r',encoding='utf-8')
labelfile = open('label.txt','r',encoding='utf-8')

words, labels = [], []

count = 0
for data, label in zip(datafile, labelfile):
    count += 1
    s1 = data.strip('\n').split(' ')
    s2 = label.strip('\n').split(' ')

    words.append(s1)
    labels.append(s2)
    if count == 10000:
        break

datafile.close()
labelfile.close()

# Get words set
all_words = list(chain(*words))   # words为二维数组，通过chain和*，将words拆成一维数组
all_words_sr = pd.Series(all_words)  # 序列化 类似于一维数组的对象，它由一组数据（各种NumPy数据类型）以及一组与之相关的数据标签（即索引）组成。
all_words_counts = all_words_sr.value_counts()  # 计算字频
all_words_set = all_words_counts.index  # index为字，values为字的频数，降序

# Get words ids
all_words_ids = range(1, len(all_words_set) + 1)  # 字典，从1开始

# Dict to transform
word2id = pd.Series(all_words_ids, index=all_words_set)  # 按字频降序建立所有字的索引，字-id
id2word = pd.Series(all_words_set, index=all_words_ids)  # id-字

# Tag set and ids
tags_set = ['0', '1', '2', '3',...]  # 为解决OOV(Out of Vocabulary)问题，对无效字符标注取零
tags_ids = range(len(tags_set))

# Dict to transform
tag2id = pd.Series(tags_ids, index=tags_set)
id2tag = pd.Series(tags_set, index=tag2id) 

max_length = 200 # 句子最大长度

def x_transform(words):
    ids = list(word2id[words])
    if len(ids) >= max_length:
        ids = ids[:max_length]
    ids.extend([0] * (max_length - len(ids)))
    return ids
    
def y_transform(tags):
    ids = list(tag2id[tags])
    if len(ids) >= max_length:
        ids = ids[:max_length]
    ids.extend([0] * (max_length - len(ids)))
    return ids

print('Starting transform...')
# print(words)
data_x = list(map(lambda x: x_transform(x), words))   # 字对应的id的序列，words为二维array，多个seq时map并行处理
data_y = list(map(lambda y: y_transform(y), labels))  # 字对应的标注的id的序列，二维列表

data_x = np.asarray(data_x)
data_y = np.asarray(data_y)

from os import makedirs
from os.path import exists, join

path = 'data/'
if not exists(path):
    makedirs(path)

print('Starting pickle to file...')
with open(join(path, 'data-small.pkl'), 'wb') as f:
    pickle.dump(data_x, f)  # 序列化对象并追加
    pickle.dump(data_y, f)
    pickle.dump(word2id, f)
    pickle.dump(id2word, f)
    pickle.dump(tag2id, f)
    pickle.dump(id2tag, f)
print('Pickle finished')

2、搭建网络模型

2.1.4版本的keras，在keras版本里面已经包含bilstm模型，CRF模型包含在keras-contrib中。
双向LSTM和单向 LSTM 的区别是用到 Bidirectional。
模型结构为一层embedding层+一层BiLSTM+一层CRF。
注意损失函数的调用

构建网络模型代码如下：

EMBED_DIM = 200
BiRNN_UNITS = 200

def load_data():
    source_data = 'data/data-small.pkl'
    with open(source_data, 'rb') as f:
        data_x = pickle.load(f)
        data_y = pickle.load(f)
        word2id = pickle.load(f)
        id2word = pickle.load(f)
        tag2id = pickle.load(f)
        id2tag = pickle.load(f)
        return data_x, data_y, word2id, id2word, tag2id, id2tag

def create_model(train=True):
    if train:
        data_x, data_y, word2id, id2word, tag2id, id2tag = load_data()
        data_y = data_y.reshape((data_y.shape[0], data_y.shape[1], 1))
        train_x, test_x, train_y, test_y = train_test_split(data_x, data_y, test_size=0.2, random_state=40)
        vocab = word2id.keys()
        chunk_tags = tag2id.keys()

    else:
        data_x, data_y, word2id, id2word, tag2id, id2tag = load_data()
        vocab = word2id.keys()
        chunk_tags = tag2id.keys()
    model = Sequential()
    model.add(Embedding(len(vocab)+1, EMBED_DIM, mask_zero=True))  # Random embedding
    model.add(Bidirectional(LSTM(BiRNN_UNITS // 2, return_sequences=True)))
    crf = CRF(len(chunk_tags), sparse_target=True)
    model.add(crf)
    model.summary()
    model.compile('adam', loss=crf.loss_function, metrics=[crf.accuracy])
    if train:
        return model, (train_x, train_y), (test_x, test_y)
    else:
        return model, (vocab, chunk_tags)

3、训练数据

在处理好数据后可以训练数据，本文中手动切分验证集和训练集。为了加快训练速度，设批量大小= 64，进行了10个epoch的训练。

import bilsm_crf_model
EPOCHS = 10
model, (train_x, train_y), (test_x, test_y) = bilsm_crf_model.create_model()
# train model
model.fit(train_x, train_y,batch_size=64,epochs=EPOCHS, validation_data=[test_x, test_y])
model.save('model/withoutcrf.h5')

4、验证数据

输入的是字符，需要根据word2id将字符转为编号输入到模型中预测label。

import bilsm_crf_model
import process_data
import numpy as np
from keras.preprocessing.sequence import pad_sequences

maxlen = 200

model, (vocab, chunk_tags) = bilsm_crf_model.create_model(train=False)
predict_text = '人民网成都五月五日电昨日成都住房公积金管理中心发布通知'
word2idx = dict((w, i) for i, w in enumerate(vocab))
print(len(word2idx))
x = [word2idx.get(w[0].lower(), 1) for w in predict_text]
length = len(x)
x = pad_sequences([x], maxlen)  # left padding
str, length = process_data.process_data(predict_text, vocab)
model.load_weights('model/crf.h5')
raw = model.predict(str)[0][-length:]
result = [np.argmax(row) for row in raw]
result_tags = [chunk_tags[i] for i in result]

print(result_tags)

5、扩展

如果仅想用keras构建BiLSTM而不加CRF的话，模型构建如下即可：
注意多分类要把crf层改为softmax层。

model = Sequential()
model.add(Embedding(len(vocab)+1, EMBED_DIM, mask_zero=True))  # Random embedding
model.add(Bidirectional(LSTM(BiRNN_UNITS // 2, return_sequences=True)))
model.add(Dense(22, activation='softmax'))
model.summary()
model.compile('adam', loss='categorical_crossentropy', metrics=['accuracy'])

遇到问题：

```
[[{{node embedding_1/embedding_lookup}}]] tensorflow.python.framework.errors_impl.InvalidArgumentError: indices[498,120] = 74072 is not in [0, 74072)
```
原因：
发现在这段代码中vocab是从1开始编码的。
后面在初始化embedding层时，刚开始的初始化方法为model.add(Embedding(len(vocab), EMBED_DIM, mask_zero=True))，即嵌入矩阵是n_words*embed_size大小的。但是和前面的比较下，可以发现我们嵌入矩阵的大小应该为(n_words+1)*embed_size，因为除了review中的所有词以外，我们还添加了0作为补全位。
所以应修改为model.add(Embedding(len(vocab)+1, EMBED_DIM, mask_zero=True)).
label输入格式不正确。

tensorflow.python.framework.errors_impl.InvalidArgumentError: Index out of range using input dim 2; input has only 2 dims
[[{{node loss/crf_1_loss/strided_slice}}]]

应将data_y从2维改为3维：
data_y = data_y.reshape((data_y.shape[0], data_y.shape[1], 1))

参考源码：
https://github.com/stephen-v/zh-NER-keras
参考网址：
https://blog.csdn.net/qq_25439417/article/details/83651714

vivian_ll

发布了143 篇原创文章 · 获赞 161 · 访问量 29万+

私信关注