Chinese Academy of Sciences depth text-matching simple to use open source project MatchZoo

MatchZoo is a Python environment based on open source text-matching tool TensorFlow development, so that we design a more intuitive understanding of the depth of the text to match the model, the performance difference is more convenient to compare different models more quickly develop new depth to match the model. As far as I understand, MatchZoo main idea is to achieve a depth model for the twin Network

Similarity matching text contains text, contains text, matching quiz question, here, I'll simply use Microsoft's public data sets MSR similarity calculation explanation, I refer to the code by the official explanation matchzoo done, if anything error, daring to say that I will be further improved.

Data Set Description: MSR short English text data set is the standard similarity calculation data set, wherein the training set has 4076 sentences, which contains a 2753 similarity, that is, the positive example sentences; test set has 1725 sentences, wherein 1147 contains positive example sentences.
Download Link: https://www.microsoft.com/en-us/download/details.aspx?id=52398&from=http%3A%2F%2Fresearch.microsoft.com%2Fen-us%2Fdownloads%2F607d14d9-20cd-47e3 -85bc-a2f65cd28042% 2Fdefault.aspx
if that trouble, I have downloaded a good share out: https://download.csdn.net/download/weixin_40902563/12047069
earn points is not easy, if not integration, you can whisper complicated by mail, I will be sent free of charge to you (provided that I have read news Kazakhstan, about two days to see)

Official data set is such a format: Here Insert Picture Description
it is tsv file, but with pandas to read it, will be a bit wrong, so I used readlines way, separators with \ t to cut each line.
The next code starts, although this is not my style of writing code, but in order to make it clear that every step, I will explain a section.
Import data processor modules and definitions:

#!/usr/bin/python3
# -*- coding:utf-8 -*-
# Author:ChenYuan

import matchzoo as mz
import pandas as pd
import re  # 这个可以不用
from sklearn import preprocessing  # 用于正则化
import numpy as np 
preprocessor = mz.preprocessors.BasicPreprocessor()  # 定义一个数据处理器,有四种处理器,Basic是通用的、基础的数据处理器,可看官方文档,这里不做解说

Data format conversion:

data = []
data_type = 'train'
with open('msr_paraphrase_%s.txt' % data_type, 'r', encoding='utf-8')as f:
    for line in f.readlines()[1:]:  # 这个是为了忽略标题
        line = line.strip().split('\t')
        data.append([line[1], line[3], line[2], line[4], line[0]])  # 是为了方便matchzoo的输入格式
data = pd.DataFrame(data)
data.to_csv(train_data_path, header=False, index=False, sep='\t')

Read data:

def load_data(data_path):
	df_data = pd.read_csv(data_path, sep='\t', header=None)
	df_data = pd.DataFrame(df_data.values, columns=['id_left', 'text_left', 'id_right', 'text_right', 'label'])
	df_data = mz.pack(df_data)
	reture df_data

train_data = load_data(train_data_path)  # 这里就是上面数据格式转换的训练集和测试集路径
test_data = load_data(test_data_path)

data processing:

train_dev_split = int(len(train_data) * 0.9)  # 验证集占训练数据的0.1
train = train_data[:train_dev_split]
dev = train_data[train_dev_split:]
train_pack_processed = preprocessor.fit_transform(train)  # 其实就是做了一个字符转id操作,所以对于中文文本,不需要分词
dev_pack_processed = preprocessor.transform(dev)  
test_pack_processed = preprocessor.transform(test_data)
train_data_generator = mz.DataGenerator(train_pack_processed, batch_size=32, shuffle=True)  # 训练数据生成器

test_x, test_y = test_pack_processed.unpack()
dev_x, dev_y = dev_pack_processed.unpack()

Definition Model:

def build():
    model = mz.models.DUET()  # 同样,DUET网络可看官网的论文,这里不做解释;同样,模型的参数不做解释,官方文档有
    ranking_task = mz.tasks.Ranking(loss=mz.losses.RankCrossEntropyLoss(num_neg=1))  # 定义损失函数,这里采用的是排序交叉熵损失函数,它还有一个分类交叉熵损失函数,看你如何定义你的数据
    model.params['input_shapes'] = preprocessor.context['input_shapes']
    model.params['embedding_input_dim'] = preprocessor.context['vocab_size']  # 如果版本较老,这里需要加1,因为要考虑一个UNK的字符,如果版本较新,这个以更新解决
    model.params['embedding_output_dim'] = 300 
    model.params['task'] = ranking_task
    model.params['optimizer'] = 'adam'
    model.params['padding'] = 'same'
    model.params['lm_filters'] = 32
    model.params['lm_hidden_sizes'] = [32]
    model.params['dm_filters'] = 32
    model.params['dm_kernel_size'] = 3
    model.params['dm_d_mpool'] = 3
    model.params['dm_hidden_sizes'] = [32]
    model.params['activation_func'] = 'relu'
    model.params['dropout_rate'] = 0.32
    model.params['embedding_trainable'] = True
    model.guess_and_fill_missing_params(verbose=0)
    model.params.completed()
    model.build()
    model.backend.summary()
    model.compile()
    return model

training:

model = build()
batch_size = 32

evaluate = mz.callbacks.EvaluateAllMetrics(model, x=dev_x, y=dev_y, batch_size=batch_size)
model.fit_generator(train_data_generator, epochs=5, callbacks=[evaluate], workers=5, use_multiprocessing=False)
y_pred = model.predict(test_x)

left_id = test_x['id_left']
right_id = test_x['id_right']
assert (len(left_id) == len(left_id))
assert (len(left_id) == len(y_pred))
assert (len(test_y) == len(y_pred))
Scale = preprocessing.MinMaxScaler(feature_range=(0, 1))  # 对结果做规范化
y_pred = Scale.fit_transform(y_pred)

Thus, the end of the code, the final result is a specification to y_pred score between 0-1, the representative similarity left and right sentence scoring sentences because there matchzoo generated score is greater than 1 and less than 0, in terms of similarity , is greater than a theoretically impossible, the greatest similarity should be a minimum of less than 0, so the use of standardized, so that the threshold value (which may be not directly thresholding may be).
For MSR data, evaluation indicators F1-macro and accuracy (Accuracy), the result becomes 0 and the need to go to a set of test results and the score requirements. Here is not to give this code, but the rest of the evaluation of an array of operational problems, the most difficult to read data and model training has been given.

in conclusion:

  1. Embedding Embedding is the use of layers matchzoo layer keras to training word2id random sequence, but also provides a method for official word vector model import their own pre-trained, I tried many times with Wiki's pre-trained Chinese model, is a less efficient, and it is the result of a poor, so this is not to write this code.
  2. Why write DUET? On my machine, this model efficiency is relatively high, and the effect is the best, even if only use the official initial model parameters.
  3. We are not using this experimental data set, I was specifically looking for this data set to write a blog, that is to say, my code has been run through.
  4. For DSSM CDSSM model and data processors, our machine can not run, so for these two models, there is no experiment, but we ourselves reproduce the model of DSSM and CDSSM.
  5. Learn biggest treasure matchzoo I do not think the code will be playing, and understand these twins network results, and re-own now is the best.
  6. Thank you can see here, thank you support.
Published 12 original articles · won praise 3 · Views 2046

Guess you like

Origin blog.csdn.net/weixin_40902563/article/details/103663689