Baidu Business AI Technology Innovation Competition Track 2: AIGC Reasoning Performance Optimization TOP10 Experience Sharing

Friends, the AIGC Performance Optimization Contest has ended. Many team members have completed the defense and receiving awards after reading the news. I learned from insiders that the final code and results of the competition will not be shared, because the purpose of the competition is to Attracting the best code and saving the company the cost of its own development is equivalent to outsourcing it, and it should not be made public. With the spirit of technology sharing and openness, I will share my experience of ranking top 10 in the rematch today, hoping to provide some helpful information to the participating friends (personal account: I am your wolf brother).

First of all, I will share the draft version of the competition: Text Generation: AIGC Reasoning Performance Optimization Competition_Experience sharing of the 10th place in the semi-finals and preliminary competitions- Flying Paddle AI Studio

This version omits a lot of content, because there are a lot of temporary files, test files and personal codes in the most original version. This version is equivalent to castrating a part, but I will introduce the specific content to you first, so that it is easy to understand .

1. Method exploration 

To optimize model reasoning, the official has given some basic suggestions. In fact, everyone can follow the official practice at the beginning, and there will be improvements.

(1) Adjust hyperparameters, feasible

Adjusting hyperparameters is the fastest and most convenient method, but you should pay attention to the method. If you adjust jb without reason, it will appear high and low for a while. You have to find a way to get close to the limit. You can refer to the use of the Internet. Grid search, the following is a simple case in the project, in the new/new.ipynb project

The above code can be improved by itself. I have a predict.py file. You can for loop one by one, and then record the inference speed of the best parameters, and solidify the parameters. I remember the official base results. It is about 460s without tuning. , if you can optimize to more than 200s through this item only, but if you want to optimize again, it will be very difficult, and other methods are needed.

 (2) It is feasible to directly call the static library

It is quite simple to adjust hyperparameters, but it has a bottleneck, and it will be difficult for you to optimize. At this time, you need to look through the source code of paddlepaddle. He reasoned that there are some optimization methods in it, such as

In the official run_infer.py, if you add this thing to the last three lines of model.generate, you will find that the speed suddenly rises, and it can be optimized for about 100s, so fast!

Soon, you found another very strange problem. After you added this use_fast=True, although it is fast, but every time you reason for the first time, you need to re-download and compile the static library file of this module, which is really slow, at least 40 -50s, it is too time-consuming, but if you look through the official documents, you will find that it is only very slow at the first inference, and then it will directly call the compiled library file, which will be very fast! then what should we do? It is easy to handle, we directly find the compiled library file and call it directly. The result proves that it is very feasible and the speed is increased by 30-50s.

Someone asked, how do I find this so library file, in fact, it is also very simple, you reason for the first time and let him reason in the original way, after the reasoning is completed, it will automatically generate this libdecoding_op.so, directly use the find global search to find it , in fact, this is an inference operator written by cpp. It can be seen that cpp is much more efficient than python in this respect. There is also a point buried here, which will be discussed later. 

  (3) Reasoning from dynamic graph to static graph is not feasible

I estimate that 90% of people will think about converting the dynamic graph reasoning in the coding stage to static graph reasoning when they try it for the first time. , Tried several times but still can't, this road gave up.

 Below is the conversion code

# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse
import os
from pprint import pprint

import paddle

from paddlenlp.ops import FasterUNIMOText
from paddlenlp.transformers import UNIMOLMHeadModel, UNIMOTokenizer
from paddlenlp.utils.log import logger


def parse_args():
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--model_name_or_path",
        default="/home/aistudio/ad_generator/model_final",
        type=str,
        help="The model name to specify the Pegasus to use. ",
    )
    parser.add_argument(
        "--export_output_dir", default="./inference_model", type=str, help="Path to save inference model of Pegasus. "
    )
    parser.add_argument("--topk", default=80, type=int, help="The number of candidate to procedure top_k sampling. ")
    parser.add_argument(
        "--topp", default=0.8, type=float, help="The probability threshold to procedure top_p sampling. "
    )
    parser.add_argument("--max_out_len", default=128, type=int, help="Maximum output length. ")
    parser.add_argument("--min_out_len", default=6, type=int, help="Minimum output length. ")
    parser.add_argument("--num_return_sequence", default=1, type=int, help="The number of returned sequence. ")
    parser.add_argument("--temperature", default=0.8, type=float, help="The temperature to set. ")
    parser.add_argument("--num_return_sequences", default=2, type=int, help="The number of returned sequences. ")
    parser.add_argument("--use_fp16_decoding", action="store_true", help="Whether to use fp16 decoding to predict. ")
    parser.add_argument(
        "--decoding_strategy",
        default="beam_search",
        choices=["beam_search"],
        type=str,
        help="The main strategy to decode. ",
    )
    parser.add_argument("--num_beams", default=2, type=int, help="The number of candidate to procedure beam search. ")
    parser.add_argument(
        "--diversity_rate", default=0.0, type=float, help="The diversity rate to procedure beam search. "
    )
    parser.add_argument(
        "--length_penalty",
        default=1.2,
        type=float,
        help="The exponential penalty to the sequence length in the beam_search strategy. ",
    )

    args = parser.parse_args()
    return args


def do_predict(args):
    place = "gpu:0"
    place = paddle.set_device(place)

    model_name_or_path = args.model_name_or_path
    model = UNIMOLMHeadModel.from_pretrained(model_name_or_path)
    tokenizer = UNIMOTokenizer.from_pretrained(model_name_or_path)

    unimo_text = FasterUNIMOText(model=model, use_fp16_decoding=args.use_fp16_decoding, trans_out=True)

    # Set evaluate mode
    unimo_text.eval()

    # Convert dygraph model to static graph model
    unimo_text = paddle.jit.to_static(
        unimo_text,
        input_spec=[
            # input_ids
            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
            # token_type_ids
            paddle.static.InputSpec(shape=[None, None], dtype="int64"),
            # attention_mask
            paddle.static.InputSpec(shape=[None, 1, None, None], dtype="float32"),
            # seq_len
            paddle.static.InputSpec(shape=[None], dtype="int64"),
            args.max_out_len,
            args.min_out_len,
            args.topk,
            args.topp,
            args.num_beams,  # num_beams. Used for beam_search.
            args.decoding_strategy,
            tokenizer.cls_token_id,  # cls/bos
            tokenizer.mask_token_id,  # mask/eos
            tokenizer.pad_token_id,  # pad
            args.diversity_rate,  # diversity rate. Used for beam search.
            args.temperature,
            args.num_return_sequences,
        ],
    )

    # Save converted static graph model
    paddle.jit.save(unimo_text, os.path.join(args.export_output_dir, "unimo_text"))
    logger.info("UNIMOText has been saved to {}.".format(args.export_output_dir))


if __name__ == "__main__":
    args = parse_args()
    pprint(args)

    do_predict(args)

(4) System parameter optimization, feasible

Still the same sentence, look at the paddlepaddle source code, you will have a lot of surprises, there are system tuning methods in the source code, mainly for graphics card tuning, so you can add the following magical code.

So, you can improve by 1-3s again, which is another small milestone progress. 

(5) The reasoning code is completely rewritten in cpp, which is feasible but not

I buried a point in (2), I said later, in fact, you can rewrite the entire reasoning code, model.generate, into cpp, which will definitely improve greatly, but, I don’t meeting! I don’t know how to write cpp, so I can only try it privately, and I asked an insider, this is definitely feasible, try it in private, this has nothing to do with the model, it’s an engineering job, and it’s okay to expand it What do you mean, it's just a translation process, do it yourself!

(6) tensorRT optimization, unknown

In fact, there is another way to optimize tensorRT. I tried a demo. There may be some demos in the code, but the effect is not obvious, so I didn’t try it. The effect is unknown, so try it yourself.

 2. Overall summary

The above is my attempt of the whole project. For the specific details, you can only know by running my code yourself. There are many wrong attempts in it, including I even tried memory sharing technology, multi-threading technology, multi-process technology and asynchronous technology. Processing, etc., are not very ideal. Maybe you will improve after trying. The biggest improvement here is hyper-parameter optimization and so library calling.

Guess you like

Origin blog.csdn.net/qq_23953717/article/details/132426539