[Intensive reading of papers] HugNLP: A Unified and Comprehensive Library for Natural Language Processing

foreword

The NLP general task framework can lower the threshold of NLP task processing and provide NLP researchers with efficient processing solutions for NLP tasks. This will further promote the development of the NLP field, which can be said to be a milestone work~


Abstract

HugNLP is a unified and comprehensive natural language processing library designed to allow NLP researchers to leverage off-the-shelf algorithms to develop new methods using user-defined models and tasks in real-world scenarios. Its structure consists of models, processors and applications, and it unifies the learning process of pre-trained models on different NLP tasks. The author demonstrates the effectiveness of HugNLP through some characteristic NLP applications such as general information extraction, low resource mining, code understanding and generation, etc.

1. Introduction

The pre-trained language model has become the infrastructure of many NLP downstream tasks under the two steps of pre-training and fine-tuning. However, the existing methods have different architectures and different models, making it difficult to get started. Therefore, HugNLP, as a unified and comprehensive open source library, allows researchers to effectively develop and evaluate NLP models. The backend is based on HuggingFace's Transformers library, and the training part integrates the tracking toolkit MLFlow, which is convenient for observing the progress and recording of experiments. HugNLP consists of three parts: model, processor and application.
The model provides commonly used PLMs, such as BERT, RoBERTa, DeBERTa, GPT-2 and T5, etc. Based on these models, the authors developed task-specific modules for pre-training and fine-tuning, and also provided prompt-based fine-tuning techniques for efficient parameter tuning of PLM, such as PET, P-tuning, Prefix-tuning, Adapter-tuning .
In the processor part, relevant data processing tools are developed for some commonly used datasets and task-specific corpora.
In the application part, KP-PLM is proposed, which realizes plug-and-play knowledge injection in model pre-training and fine-tuning by converting structural knowledge into a unified language prompt. In addition, HugIE, a general information extraction toolkit, is developed through instruction fine-tuning and extraction modeling.
In summary, HugNLP has the following characteristics:

  1. HugNLP provides a series of pre-built components and modules that can be used to quickly develop and simplify the implementation of complex NLP models and tasks;
  2. HugNLP can be easily integrated into existing workflows to meet the individual needs of researchers;
  3. HugNLP implements some solutions for specific scenarios, such as KP-PLM and HugIE;
  4. Based on two widely used platforms, PyTorch and HuggingFace, it is easy to use.

2. Background

2.1 Pre-trained Language Models

The goal of PLM is to learn semantic representations of unsupervised corpora through well-designed self-supervised learning tasks in the pre-training stage. PLMs architectures can be divided into encoder-only, encoder-decoder and decoder-only architectures, however these PLMs may lack background knowledge on some specific tasks. Therefore, knowledge-augmented PLM is proposed to obtain rich knowledge from external databases. Recent large models such as GPT-3 can be used in low-resource scenarios through prompts or instructions. Therefore, cross-task learning can be used to unify the semantic knowledge of different NLP tasks.

2.2 Fine-tuning for PLMs

In practical scenarios, we focus on how to fine-tune PLM to transfer knowledge to downstream tasks. HugNLP integrates some task-oriented fine-tuning methods, and also implements popular tuning algorithms such as prompt and context learning.

3. HugNLP

3.1 Overview

image.png
HugNLP is an open source library with a multi-layer structure. The backend is the HuggingFace Transformers platform, which provides multiple transformer-based models. In addition, HugNLP integrates MLFlow, a novel tracking callback toolkit for model training and experiment result analysis.

3.2 Library Architecture

3.2.1 Models

The model provides the popular transformer-based model, and released KP-PLM, a novel knowledge-enhanced pre-training model that utilizes just the promp paradigm to inject factual knowledge, which can be easily used for arbitrary PLMs. In addition, the authors implement task-specific models involving sequence classification, matching, tagging, span extraction, text generation, etc. For the few-shot learning setting, HugNLP provides a prototype network for few-shot text classification and NER.
image.png
There are also some plug-and-play utilities incorporated into HugNLP.

  1. Parameters are frozen. As shown in the figure above, the training efficiency is improved by freezing some parameters in the PLM.
  2. Uncertainty estimates. Designed to compute the accuracy of models in semi-supervised learning.
  3. Predictive Calibration. Improve accuracy by calibrating the distribution and mitigating semantic bias.

3.2.2 Processors

HugNLP is designed to load data into the pipeline to process task instances, including labeling data, sampling and generating tensors. For different tasks, users need to define task-specific data collations, the purpose of which is to convert raw data into model tensor features.

3.2.3 Applications

By setting the model and processor, it provides users with rich modules to build real applications.

3.3 Core Capacities

The core parts of HugNLP are as follows:

  1. Knowledge-enhanced pre-training: Conventional pre-training methods lack factual knowledge, so KP-PLM proposes to provide knowledge enhancement for pre-training. Specifically, a knowledge subgraph is constructed for each input text via entity recognition and knowledge alignment, and then the subgraph is decomposed into multiple relational paths for direct translation into language prompts.
  2. Prompt-based reasoning: aims to reuse pre-trained targets, use designed templates to predict language, and is suitable for low-resource scenarios.
  3. Instruction-tuning and In-Context Learning: No need to update parameters, suitable for low-resource scenarios. The goal is to incorporate task-aware instructions to prompt GPT-style PLMs to generate reliable responses. Inspired by this, the authors extend it to two paradigms: the general extraction paradigm and the inference paradigm.
  4. Uncertainty-aware self-training: Self-training can label unlabeled data to solve the data sparsity problem, but standard self-training will generate too much noise and reduce model performance. Uncertainty-aware self-training trains a teacher model on a few labeled data, then uses Monte Carlo dropout techniques on a Bayesian neural network to approximate model certainty, wisely choosing higher model certainty on the teacher model.
  5. Efficient learning of parameters: by freezing the parameters of the backbone so that only a small number of parameters are adjusted during training. The author has developed some novel parameter efficient learning methods, such as Prefix-tuning, Adapter-tuning, etc.

3.4 Featured Applications

  1. Benchmark fine-tuning. The authors develop training applications for some popular benchmarks, such as Chinese CLUE and GLUE. Use fine-tuning (including prompt-tuning) methods to tune the PLM against these benchmarks. The code example is as follows:

image.png

  1. General information extraction based on extraction instructions. The author developed HugIE, a general information extraction toolkit based on HugNLP. Specifically, the authors collected multiple Chinese NER and event extraction databases. The general information extraction model is then pretrained using extractive instructions with global pointers.
  2. PLMs tuning for low-resource scenarios. Integrates prompt fine-tuning and uncertainty-aware self-training. The former can make full use of pre-training stage knowledge, and the latter can enhance data.
  3. Code understanding and generation. Includes clone detection, code summarization, defect detection.

3.5 Development Workflow

image.png
The figure above shows how to start a new task with HugNLP. Including installation, data preparation, processor selection or design, model selection or design, and application program design five steps. HugNLP can simplify the implementation of complex NLP models.

4. Experimental Performances

4.1 Performance of Benchmarks

In order to verify the effectiveness of HugNLP on fine-tuning and tuning on Prompt, the Chinese CLUE and GLUE benchmarks are selected.
image.png
The above table shows the performance of models of different scales on Chinese CLUE. For CLUE, based on the proposed KP-PLM for full-resource fine-tuning, small-sample real-time fine-tuning and zero-sample real-time fine-tuning, we choose RoBERTa as the baseline as shown in the table below.
image.png
The above table reflects the reliability of HugNLP in full-resource and low-resource scenarios, and has similar performance compared to other open source frameworks and original implementations.

4.2 Evaluation of Code-related Tasks

The authors use HugNLP to evaluate the performance of several code-related tasks, such as code cloning, defect detection, translation, etc. After fine-tuning two widely used models: CodeT5 and PLBART, and then comparing with competitive parameter efficient learning methods, the following table proves the effectiveness and efficiency of HugNLP.
image.png

4.3 Effectiveness of Self-training

Finally, self-training is verified. With only 16 labeled samples the performance is as follows:
image.png

5. Conclusion

HugNLP is a unified and comprehensive library based on PyTorch and HuggingFace. It has three important components, processor, model and application, as well as multiple core functions and tools. Finally, the experimental results show that HugNLP can promote the research and development of NLP.

read summary

An article about the cutting-edge work of the NLP model tool framework. No specific application examples are given in the article. After that, I will try to deploy HugNLP. I don’t know how high the demand for equipment is. This work still inspires me, especially the ability of the current large model to handle various tasks in general, such as general extraction and summary generation. These tasks can be done with one model, but different codes need to be written. Such time The cost and debugging cost are still relatively high. With HugNLP, many common operations are packaged, and only need to change the settings, reducing a lot of time-consuming work, which will greatly improve the efficiency of processing NLP tasks later.

Guess you like

Origin blog.csdn.net/HERODING23/article/details/130765069
Recommended