Improvement of Jieba word segmentation accuracy: use paddle mode for word segmentation (use Baidu Flying Paddle deep learning model for word segmentation)

1 Introduction to Paddle mode

jiebaThe mode in paddlerefers to a mode that uses the PaddlePaddle deep learning framework to accelerate word segmentation. Compared with the traditional word segmentation algorithm, paddlethe mode uses a deep learning model, which can obtain higher word segmentation accuracy and faster word segmentation speed.

paddleThe pattern is implemented based on a Convolutional Neural Network (CNN). During the training process, the Chinese Wikipedia corpus and the automatic tagging corpus were used for supervised training on the word segmentation task. In the test process, the text is converted into a feature vector through convolution operation, and then through the fully connected layer and softmax layer, the probability distribution of each character is finally obtained, and then the word segmentation boundary is determined according to the probability distribution.
image.png

2 Preparation for Paddle mode

2.1 Installation of paddlepaddle library

To use paddlepattern word segmentation, you need to install paddlepaddlethe library first:

pip install paddlepaddle

If the installation is too slow, you can consider using domestic mirror sources:

1.  pip install paddlepaddle -i https://pypi.tuna.tsinghua.edu.cn/simple

After the installation is complete, check the installation:

import paddle.fluid
paddle.fluid.install_check.run_check()

paddle.fluid.install_check.run_check()It is an installation detection function provided by the PaddlePaddle framework. Running this function can check whether the current environment meets the requirements for deep learning development using Paddle.

2.2 Installation integrity check

run_check()The function will check whether the necessary dependent libraries are installed in the current environment, whether it supports GPU acceleration, whether it can connect to the server of the flying paddle, etc. If the detection result is successful, the corresponding detection information will be output; otherwise, specific error information will be output to help users troubleshoot and solve problems.
When the output is:
Running Verify Fluid Program …
Your Paddle Fluid works well on SINGLE GPU or CPU.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid now
image.png
represents us The installation was successful.

This check may encounter user warnings:
UserWarning: Standalone executor is not used for data parallel
warnings.warn(
W0326 13:38:53.591773 13228 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 1.
image.png
This warning message is issued by the PaddlePaddle framework when using data parallel training. Data parallel refers to the division of large neural network models into Multiple parts, and then train on multiple computing nodes at the same time, and finally summarize the results. When performing data parallel training, it is necessary to synchronize the gradient information on different computing nodes to ensure the correctness and convergence of training.

This warning message actually consists of two parts:

  1. UserWarning: Standalone executor is not used for data parallelIndicates that when performing data parallel training, an independent executor (Standalone Executor) should not be used, but an executor compatible with data parallel training should be used. If an incompatible executor is used, it may result in incorrect training results or unexpected behavior.
  2. Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 1.Indicates that two all_reduce operators (used to synchronize gradient information on different computing nodes) were found during data parallel training. In order to improve the training speed, these two operators will be fused into one operator, thereby reducing the amount of computation and communication overhead.

In the actual data parallel training process, this warning message can be ignored and will not affect the correctness and convergence of the training results. If you need to learn more about the details and optimization techniques of data parallel training, you can refer to the relevant documents and tutorials of Paddle.

3 Code example using Paddle word segmentation

In jieba.cutthe participle, pass in use_paddle=Trueto enable paddle mode:

import jieba  
import paddle  
  
paddle.enable_static()  
jieba.enable_paddle()  
  
text = '动嘴就能写代码,GitHub 将 ChatGPT 引入 IDE,重磅发布 Copilot X'seg_1 = jieba.cut(text, cut_all=False)  
seg_2 = jieba.cut(text, use_paddle=True)  
  
print('精 确 模 式:', "/".join(seg_1))  
print('paddle模式:', "/".join(seg_2))

Among them: paddle.enable_static() is used to enable the static graph mode. In the static graph mode, the program performs calculations by pre-constructing the calculation graph, which can improve computing efficiency.
jieba.enable_paddle() is used to enable paddle mode.
The running result of this code is:
image.png
You can see that for some rare words, such as Copilot X, the deep learning model of the flying paddle can separate them, and you can also try to judge the quality of the two models by yourself.

It should be noted that enabling the paddle mode requires certain hardware and software support, such as a CPU that supports the AVX instruction set and a GPU with the CUDNN library installed. If enabling paddle mode fails, you can try to use other word segmentation modes. At the same time, when using the paddle mode, the memory usage of the program may increase due to the need to load the deep learning model.

Guess you like

Origin blog.csdn.net/nkufang/article/details/129788741