Basic introduction and use of pkuseg-python

1. Basic concepts and highlights of pkuseg

1. What is pkuseg

pkuseg is a new set of Chinese word segmentation toolkit developed by Peking University's Language Computing and Machine Learning Research Group.
The URL of pkuseg in github is as follows: https://github.com/lancopku/pkuseg-python

2. Main highlights
  • Multi-domain word segmentation. Unlike previous general Chinese word segmentation tools, this toolkit is also dedicated to providing personalized pre-training models for data in different fields. According to the field characteristics of the text to be segmented, users can freely choose different models. We currently support word segmentation pre-training models in the news field, network field, medicine field, tourism field , and mixed field . In use, if the user knows the field to be segmented, the corresponding model can be loaded for word segmentation. If the user cannot determine the specific field, it is recommended to use a general model trained on the mixed field. For examples of word segmentation in various fields, please refer to example.txt .
  • Higher accuracy of word segmentation. Compared with other word segmentation toolkits, pkuseg can achieve higher word segmentation accuracy when using the same training data and test data.
  • Support user self-training model. Support users to use brand new annotation data for training.
  • Support part of speech tagging.

Second, compile and install

  • Currently only supports python3
  • In order to get good results and speed, it is strongly recommended that you update to the latest version through pip install
1. Install via PyPI (bring your own model file):
pip3 install pkuseg

Then use import pkuseg to quote and
suggest to update to the latest version for a better out-of-the-box experience:

pip3 install -U pkuseg
2. If the download speed of PyPI official source is not ideal, it is recommended to use mirror source, such as:

Initial installation:

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple pkuseg

Update:

pip3 install -i https://pypi.tuna.tsinghua.edu.cn/simple -U pkuseg
3. If you do not use the pip installation method, choose to download from GitHub, and you can run the following command to install:
python setup.py build_ext -i

The GitHub code does not include a pre-trained model, so users need to download or train the model by themselves. The pre-trained model can be found in release . When using, you need to set "model_name" as the model file.

Note : Installation methods 1 and 2 currently only support 64-bit python3 versions of Linux (ubuntu), mac, and windows. If it is not the above system, please use installation method 3 to compile and install locally.

3. How to use

1. Use the default configuration for word segmentation (if the user cannot determine the word segmentation field, it is recommended to use the default model for word segmentation)
import pkuseg

seg = pkuseg.pkuseg()           # 以默认配置加载模型
text = seg.cut('我爱北京天安门')  # 进行分词
print(text)
2. Fine-field segmentation (if the user has a clear segmentation field, it is recommended to use the fine-field model for segmentation)
import pkuseg

seg = pkuseg.pkuseg(model_name='medicine')  # 程序会自动下载所对应的细领域模型
text = seg.cut('我爱北京天安门')              # 进行分词
print(text)
3. Part-of-speech tagging is carried out at the same time. For the detailed meaning of each part-of-speech tag, please refer to tags.txt
import pkuseg

seg = pkuseg.pkuseg(postag=True)  # 开启词性标注功能
text = seg.cut('我爱北京天安门')    # 进行分词和词性标注
print(text)
4. Segmentation of the document
import pkuseg

# 对input.txt的文件分词输出到output.txt中
# 开20个进程
pkuseg.test('input.txt', 'output.txt', nthread=20) 
5. Additional use of user-defined dictionaries
import pkuseg

seg = pkuseg.pkuseg(user_dict='my_dict.txt')  # 给定用户词典为当前目录下的"my_dict.txt"
text = seg.cut('我爱北京天安门')                # 进行分词
print(text)
6. Use self-training model to segment words (take CTB8 model as an example)
import pkuseg

seg = pkuseg.pkuseg(model_name='./ctb8')  # 假设用户已经下载好了ctb8的模型并放在了'./ctb8'目录下,通过设置model_name加载该模型
text = seg.cut('我爱北京天安门')            # 进行分词
print(text)
7. Train a new model (the model is initialized randomly)
import pkuseg

# 训练文件为'msr_training.utf8'
# 测试文件为'msr_test_gold.utf8'
# 训练好的模型存到'./models'目录下
# 训练模式下会保存最后一轮模型作为最终模型
# 目前仅支持utf-8编码,训练集和测试集要求所有单词以单个或多个空格分开
pkuseg.train('msr_training.utf8', 'msr_test_gold.utf8', './models')
8. Fine-tune training (continue training from the preloaded model)
import pkuseg

# 训练文件为'train.txt'
# 测试文件为'test.txt'
# 加载'./pretrained'目录下的模型,训练好的模型保存在'./models',训练10轮
pkuseg.train('train.txt', 'test.txt', './models', train_iter=10, init_model='./pretrained')
9. Parameter description

1) Model configuration

pkuseg.pkuseg(model_name = "default", user_dict = "default", postag = False)
	model_name		模型路径。
				    "default",默认参数,表示使用我们预训练好的混合领域模型(仅对pip下载的用户)"news", 使用新闻领域模型。
					"web", 使用网络领域模型。
					"medicine", 使用医药领域模型。
					"tourism", 使用旅游领域模型。
			        model_path, 从用户指定路径加载模型。
	user_dict		设置用户词典。
					"default", 默认参数,使用我们提供的词典。
					None, 不使用词典。
					dict_path, 在使用默认词典的同时会额外使用用户自定义词典,可以填自己的用户词典的路径,词典格式为一行一个词(如果选择进行词性标注并且已知该词的词性,则在该行写下词和词性,中间用tab字符隔开)。
	postag		    是否进行词性分析。
					False, 默认参数,只进行分词,不进行词性标注。
					True, 会在分词的同时进行词性标注。

2) Word segmentation of the document

pkuseg.test(readFile, outputFile, model_name = "default", user_dict = "default", postag = False, nthread = 10)
	readFile		输入文件路径。
	outputFile		输出文件路径。
	model_name		模型路径。同pkuseg.pkuseg
	user_dict		设置用户词典。同pkuseg.pkuseg
	postag			设置是否开启词性分析功能。同pkuseg.pkuseg
	nthread			测试时开的进程数。

3) Model training

pkuseg.train(trainFile, testFile, savedir, train_iter = 20, init_model = None)
	trainFile		训练文件路径。
	testFile		测试文件路径。
	savedir			训练模型的保存路径。
	train_iter		训练轮数。
	init_model		初始化模型,默认为None表示使用默认初始化,用户可以填自己想要初始化的模型的路径如init_model='./models/'
10. Multi-process word segmentation

When the above code example is placed in a file to run, if multi-process function is involved, be sure to use if name ==' main ' to protect the global statement, see multi-process word segmentation for details .

Fourth, the pre-training model

When users installed from pip use the fine-field segmentation function, they only need to set the model_name field to the corresponding field, and the corresponding fine-field model will be automatically downloaded.

Users who download from github need to download the corresponding pre-training model by themselves, and set the model_name field as the pre-training model path. The pre-trained model can be downloaded in the release section. The following is a description of the pre-trained model:

  • news : A model trained on MSRA (news corpus).

  • web : A model trained on Weibo (web text corpus).

  • medicine : A model trained in the field of medicine .

  • tourism : a model trained in the field of tourism .

  • mixed : The general model trained on the mixed data set. This model comes with the pip package.

Guess you like

Origin blog.csdn.net/TFATS/article/details/108851344
Recommended