Introduction
This blog will briefly record using the python version of fastText to classify different categories of news, using stuttering word segmentation and pandas data processing in the middle. News data can use Tsinghua's news data .
Install dependencies
Python version: 3.6
install stutter participle and fasttext
pip install jieba
pip install fasttext
word segmentation
Some commonly used stop words will be deleted in the process of word segmentation, stop words can be used https://github.com/dongxiexidian/Chinese/tree/master/dict
segmentation.py
import jieba
import pandas as pd
import codecs
import math
import random
stopwords_set = set()
basedir = '/Users/derry/Desktop/Data/'
# 分词结果文件
train_file = codecs.open(basedir + "news.data.seg.train", 'w', 'utf-8')
test_file = codecs.open(basedir + "news.data.seg.test", 'w', 'utf-8')
# 停用词文件
with open(basedir + 'stop_text.txt', 'r', encoding='utf-8') as infile:
for line in infile:
stopwords_set.add(line.strip())
train_data = pd.read_table(basedir + 'News_info_train.txt', header=None, error_bad_lines=False)
label_data = pd.read_table(basedir + 'News_pic_label_train.txt', header=None, error_bad_lines=False)
train_data.drop([2], axis=1, inplace=True)
train_data.columns = ['id', 'text']
label_data.drop([2, 3], axis=1, inplace=True)
label_data.columns = ['id', 'class']
train_data = pd.merge(train_data, label_data, on='id', how='outer')
for index, row in train_data.iterrows():
# 结巴分词
seg_text = jieba.cut(row['text'].replace("\t", " ").replace("\n", " "))
outline = " ".join(seg_text)
outline = " ".join(outline.split())
# 去停用词与HTML标签
outline_list = outline.split(" ")
outline_list_filter = [item for item in outline_list if item not in stopwords_set]
outline = " ".join(outline_list_filter)
# 写入
if not math.isnan(row['class']):
outline = outline + "\t__label__" + str(int(row['class'])) + "\n"
train_file.write(outline)
train_file.flush()
# 划分数据集
# if random.random() > 0.7:
# test_file.write(outline)
# test_file.flush()
# else:
# train_file.write(outline)
# train_file.flush()
train_file.close()
test_file.close()
Classification prediction
When using fasttext for training here, I adjusted the parameter word_ngrams. The original default value is 1, and the effect may be better. However, you need to add bucket=2000000 (the default value) at the back, otherwise there will be an error. I checked it in the issue, and it seems that the Python version of the fasttext version is relatively old, and the official C++ version will not have this problem.
classification.py
import logging
import fasttext
import pandas as pd
import codecs
basedir = '/Users/derry/Desktop/Data/'
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
# 训练
classifier = fasttext.supervised(basedir + "news.data.seg.train", basedir + "news.dat.seg.model", label_prefix="__label__", word_ngrams=3, bucket=2000000)
# 测试并输出 F-score
result = classifier.test(basedir + "news.data.seg.test")
print(result.precision * result.recall * 2 / (result.recall + result.precision))
# 读取验证集
validate_texts = []
with open(basedir + 'news.data.seg.validate', 'r', encoding='utf-8') as infile:
for line in infile:
validate_texts += [line]
# 预测结果
labels = classifier.predict(validate_texts)
# 结果文件
result_file = codecs.open(basedir + "result.txt", 'w', 'utf-8')
validate_data = pd.read_table(basedir + 'News_info_validate.txt', header=None, error_bad_lines=False)
validate_data.drop([2], axis=1, inplace=True)
validate_data.columns = ['id', 'text']
# 写入
for index, row in validate_data.iterrows():
outline = row['id'] + '\t' + labels[index][0] + '\tNULL\tNULL\n'
result_file.write(outline)
result_file.flush()
result_file.close()
Finally, attach the GitHub address: https://github.com/DerryChan/CSIT6000/tree/master/Derry