Python3 uses fastText for text classification news classification

Introduction

This blog will briefly record using the python version of fastText to classify different categories of news, using stuttering word segmentation and pandas data processing in the middle. News data can use Tsinghua's news data .

Install dependencies

Python version: 3.6
install stutter participle and fasttext

pip install jieba
pip install fasttext

word segmentation

Some commonly used stop words will be deleted in the process of word segmentation, stop words can be used https://github.com/dongxiexidian/Chinese/tree/master/dict

segmentation.py

import jieba
import pandas as pd
import codecs
import math
import random

stopwords_set = set()
basedir = '/Users/derry/Desktop/Data/'

# 分词结果文件
train_file = codecs.open(basedir + "news.data.seg.train", 'w', 'utf-8')
test_file = codecs.open(basedir + "news.data.seg.test", 'w', 'utf-8')

# 停用词文件
with open(basedir + 'stop_text.txt', 'r', encoding='utf-8') as infile:
    for line in infile:
        stopwords_set.add(line.strip())

train_data = pd.read_table(basedir + 'News_info_train.txt', header=None, error_bad_lines=False)
label_data = pd.read_table(basedir + 'News_pic_label_train.txt', header=None, error_bad_lines=False)

train_data.drop([2], axis=1, inplace=True)
train_data.columns = ['id', 'text']
label_data.drop([2, 3], axis=1, inplace=True)
label_data.columns = ['id', 'class']
train_data = pd.merge(train_data, label_data, on='id', how='outer')

for index, row in train_data.iterrows():
    # 结巴分词
    seg_text = jieba.cut(row['text'].replace("\t", " ").replace("\n", " "))
    outline = " ".join(seg_text)
    outline = " ".join(outline.split())

    # 去停用词与HTML标签
    outline_list = outline.split(" ")
    outline_list_filter = [item for item in outline_list if item not in stopwords_set]
    outline = " ".join(outline_list_filter)

    # 写入
    if not math.isnan(row['class']):
        outline = outline + "\t__label__" + str(int(row['class'])) + "\n"
        train_file.write(outline)
        train_file.flush()

        # 划分数据集
        # if random.random() > 0.7:
        #     test_file.write(outline)
        #     test_file.flush()
        # else:
        #     train_file.write(outline)
        #     train_file.flush()

train_file.close()
test_file.close()

Classification prediction

When using fasttext for training here, I adjusted the parameter word_ngrams. The original default value is 1, and the effect may be better. However, you need to add bucket=2000000 (the default value) at the back, otherwise there will be an error. I checked it in the issue, and it seems that the Python version of the fasttext version is relatively old, and the official C++ version will not have this problem.

classification.py

import logging
import fasttext
import pandas as pd
import codecs

basedir = '/Users/derry/Desktop/Data/'
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# 训练
classifier = fasttext.supervised(basedir + "news.data.seg.train", basedir + "news.dat.seg.model", label_prefix="__label__", word_ngrams=3, bucket=2000000)

# 测试并输出 F-score
result = classifier.test(basedir + "news.data.seg.test")
print(result.precision * result.recall * 2 / (result.recall + result.precision))

# 读取验证集
validate_texts = []
with open(basedir + 'news.data.seg.validate', 'r', encoding='utf-8') as infile:
    for line in infile:
        validate_texts += [line]

# 预测结果
labels = classifier.predict(validate_texts)

# 结果文件
result_file = codecs.open(basedir + "result.txt", 'w', 'utf-8')

validate_data = pd.read_table(basedir + 'News_info_validate.txt', header=None, error_bad_lines=False)
validate_data.drop([2], axis=1, inplace=True)
validate_data.columns = ['id', 'text']

# 写入
for index, row in validate_data.iterrows():
    outline = row['id'] + '\t' + labels[index][0] + '\tNULL\tNULL\n'
    result_file.write(outline)
    result_file.flush()

result_file.close()

Finally, attach the GitHub address: https://github.com/DerryChan/CSIT6000/tree/master/Derry

References

  1. http://blog.csdn.net/lxg0807/article/details/52960072
  2. http://www.52nlp.cn/fasttext
  3. https://pypi.org/project/fasttext/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324958168&siteId=291194637