PaddlePaddle使用句子做感情分析预测时出现结果不正确

关键字：数据字典
问题描述：使用3个句子进行预测，预测该句子的正面和负面的概率，在执行预测时大多数的结果都不正确，而且每个句子的编码都很长。
报错信息：

['read the book forget the movie', 'this is a great movie', 'this is very bad']
[[2237, 4008, 2, 2062, 5146, 3602, 3752, 4008, 5146, 951, 2903, 2903, 5146, 5146, 2414, 2903, 2237, 3316, 4008, 3602, 5146, 3602, 3752, 4008, 5146, 4136, 2903, 5146, 8, 4008], [3602, 3752, 8, 2551, 5146, 8, 2551, 5146, 2, 5146, 3316, 2237, 4008, 2, 3602, 5146, 4136, 2903, 5146, 8, 4008], [3602, 3752, 8, 2551, 5146, 8, 2551, 5146, 5146, 4008, 2237, 5146, 5146, 951, 2, 2062]]
Predict probability of  0.59143597  to be positive and  0.4085641  to be negative for review ' read the book forget the movie '
Predict probability of  0.73750913  to be positive and  0.26249087  to be negative for review ' this is a great movie '
Predict probability of  0.55495805  to be positive and  0.445042  to be negative for review ' this is very bad '

问题复现：在预测时需要把句子转换成单词列表，在把单词转换成编码。把句子转换成列表时使用reviews = [c for c in reviews_str]进行转换，然后使用这个结果通过数据集字典转换成编码进行预测，预测结果几乎都是错误的。错误代码如下：

inferencer = Inferencer(
    infer_func=partial(inference_program, word_dict),
    param_path=params_dirname,
    place=place)
reviews_str = ['read the book forget the movie', 'this is a great movie', 'this is very bad']
reviews = [c for c in reviews_str]
print(reviews)
UNK = word_dict['<unk>']
lod = []
for c in reviews:
    lod.append([word_dict.get(words.encode('utf-8'), UNK) for words in c])
print(lod)
base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
results = inferencer.infer({'words': tensor_words})

解决问题：上面错误的原因是数据预处理时，没有正确把句子中的单词拆开，导致在使用数据字典把字符串转换成编码的时候，使用的是句子的字符，所以导致错误出现。在处理的时候应该是reviews = [c.split() for c in reviews_str]。正确代码如下：

inferencer = Inferencer(
    infer_func=partial(inference_program, word_dict),
    param_path=params_dirname,
    place=place)
reviews_str = ['read the book forget the movie', 'this is a great movie', 'this is very bad']
reviews = [c.split() for c in reviews_str]
print(reviews)
UNK = word_dict['<unk>']
lod = []
for c in reviews:
    lod.append([word_dict.get(words.encode('utf-8'), UNK) for words in c])
print(lod)
base_shape = [[len(c) for c in lod]]
tensor_words = fluid.create_lod_tensor(lod, base_shape, place)
results = inferencer.infer({'words': tensor_words})

正确的输出情况：

[['read', 'the', 'book', 'forget', 'the', 'movie'], ['this', 'is', 'a', 'great', 'movie'], ['this', 'is', 'very', 'bad']]
[[325, 0, 276, 818, 0, 16], [9, 5, 2, 78, 16], [9, 5, 51, 81]]
Predict probability of  0.44390476  to be positive and  0.55609524  to be negative for review ' read the book forget the movie '
Predict probability of  0.83933955  to be positive and  0.16066049  to be negative for review ' this is a great movie '
Predict probability of  0.35688713  to be positive and  0.64311296  to be negative for review ' this is very bad '

PaddlePaddle使用句子做感情分析预测时出现结果不正确

猜你喜欢