Deploying open source large language models for building chatbots on Amazon SageMaker

Open source large language models (LLMs) have become popular and can be used by researchers, developers, and organizations to foster innovation and experimentation. This promotes collaboration among the open source community to contribute to the development and improvement of LLM. Open source LLM provides transparency into the model architecture, training process, and training data, allowing researchers to understand how the model works, identify potential biases, and address ethical issues. These open source LLMs democratize generative AI by making advanced natural language processing (NLP) technology available to a wide range of users to build mission-critical business applications. GPT-NeoX, LLaMA, Alpaca, GPT4All, Vicuna, Dolly, and OpenAssistant are some popular open source LLMs.

 OpenChatKit is an open source LLM for building general and specialized chatbot applications, released by Together Computer in March 2023 under the Apache-2.0 license. This model allows developers to have more control over the chatbot's behavior and tailor it to the chatbot's specific application. OpenChatKit provides a set of tools, base bots, and building blocks for building fully customized, powerful chatbots. The key components are as follows:

  • Instruction-tuned LLM, fine-tuned for chat from EleutherAI’s GPT-NeX-20B, has over 43 million instructions for 100% carbon-negative computing. The GPT-NeoXT-Chat-Base-20B model is based on EleutherAI’s GPT-NeoX model and fine-tuned based on data from conversational interactions.

  • Custom recipes to fine-tune models to achieve high accuracy for tasks.

  • A scalable retrieval system that enables you to enhance bot responses at inference time using information from document repositories, APIs, or other sources of real-time updates.

  • A review model fine-tuned based on GPT-JT-6B, designed to screen which questions the robot will answer.

 The increasing scale and size of deep learning models creates barriers to the successful deployment of these models in generative AI applications. To meet the requirements of low latency and high throughput, it becomes critical to adopt sophisticated methods such as model parallelization and quantization. Many users have difficulty launching large model hosting for generative AI use cases due to a lack of proficiency in applying these methods.

 In this article, Amazon Cloud Technology will show how to deploy OpenChatKit models on Amazon Cloud Technology Amazon SageMaker using DJL Serving and open source model parallel libraries such as DeepSpeed ​​and Hugging Face Accelerate. Use DJL Serving, a high-performance, general-purpose model serving solution powered by the programming language-agnostic Deep Java Library (DJL). We will demonstrate how the Hugging Face Accelerate library simplifies the deployment of large models across multiple GPUs, thereby easing the burden of running LLM in a distributed manner.

 Extensible search system

 The extensible retrieval system is one of the key components of OpenChatKit. This component enables you to tailor your bot's responses based on a closed domain knowledge base. Although LLM is able to retain factual knowledge in model parameters and can achieve good performance in downstream NLP tasks after fine-tuning, the ability of this model to accurately acquire and predict closed domain knowledge is still limited. Therefore, when encountering knowledge-intensive tasks, the performance of such models will be worse than the performance of task-specific architectures. The knowledge in responses can be augmented from external knowledge sources such as Wikipedia, document repositories, APIs, and other information sources using the OpenChatKit retrieval system.

 The retrieval system enables the chatbot to obtain current information by obtaining details relevant to a specific query, thus providing the necessary context for the model to generate answers. To illustrate the functionality of this retrieval system, Amazon Cloud Technology provides support for Wikipedia article indexing and provides sample code to demonstrate how to call the Web Search API for information retrieval. Following the provided documentation, you can integrate your retrieval system with any dataset or API during inference so that the chatbot can include dynamically updated data in its responses.

 Audit model

 Moderation models are important in chatbot applications to perform content filtering, quality control, user security, and legal and compliance reasons. Moderation is a very difficult and subjective task that highly depends on the domain in which the chatbot is applied. OpenChatKit provides tools that can be used to control chatbot applications and monitor input text prompts for any inappropriate content. The audit model provides a good baseline that can be adapted and customized for a variety of needs.

 OpenChatKit has a 6 billion-parameter moderation model, GPT-JT-Moderation-6B, that controls the chatbot to limit input to controlled topics. Although there are some controls built into the model itself, TogetherComputer trained a GPT-JT-Moderation-6B model using Ontocord.ai's OIG-moderation dataset. The model runs concurrently with the main chatbot to check whether user input and bot responses contain inappropriate results. You can also use the model to detect any out-of-domain questions asked to the chatbot and override them if the question does not fall within the chatbot's domain.

 Scalable retrieval system use cases

 While this technology can be applied across various industries to build generative AI applications, in this article, a use case from the financial industry will be discussed. The search query enhancement generation function can be used for financial research to automatically generate research reports on specific companies, industries or financial products. By retrieving relevant information from internal knowledge bases, financial archives, news reports and research papers, you can generate comprehensive reports summarizing key insights, financial indicators, market trends and investment recommendations. You can use this solution to monitor and analyze financial news, market sentiment, and trends.

 Solution overview

 The steps to build a chatbot using an OpenChatKit model and deploy this model to SageMaker are as follows:

  • Download the basic chat model GPT-NeoXT-Chat-Base-20B, package and upload the model components to Amazon Simple Storage Service (Amazon S3).

  • Use the SageMaker Large Model Inference (LMI) container, configure properties, and set up custom inference code to deploy the model.

  • Configure model parallelism technology and use the inference optimization library in DJL Serving properties. We will use Hugging Face Accelerate as the engine for DJL Serving. Additionally, we define tensor parallel configurations to partition the model.

  • Create a SageMaker model and endpoint configuration, then deploy the SageMaker endpoint.

 You can continue by running the notebook in the GitHub repository.

 Download the OpenChatKit model

 First, download the OpenChatKit base model. Use huggingface_hub, and download the model using snapshot_download, which will download the entire repository for a given version. Download simultaneously to speed up progress.

 DJL Serving properties

 You can use the SageMaker LMI container to host large generative AI models with custom inference code without having to provide your own inference code. This approach is useful when there is no custom preprocessing of the input data or postprocessing of the model predictions. You can also deploy the model using custom inference code. In this article, Amazon Cloud Technology will demonstrate how to deploy an OpenChatKit model using custom inference code.

 SageMaker requires model components to be in tar format. Each OpenChatKit model is created using the following files: serving.properties and model.py.

 The serving.properties configuration file indicates to DJL Serving which model parallelization and inference optimization libraries to use. It contains the following parameters:

  • engine - the engine to be used by DJL.

  • option.entryPoint - The entry point of a Python file or module. This should be consistent with the engine used.

  • option.s3url – Set this parameter to the URI of the S3 bucket containing the model.

  • option.modelid - If you want to download a model from huggingface.co, you can set option.modelid to the model ID of a pretrained model hosted in the model repository on huggingface.co. The container uses this model ID to download the corresponding model repository at huggingface.co.

  • option.tensor_parallel_degree - Set this parameter to the number of GPU devices DeepSpeed ​​needs to partition the model. This parameter can also control the number of Workers started by each model when DJL Serving is running. For example, if we have a computer with 8 GPUs and create eight partitions, there will be one Worker per model to handle requests. It is necessary to tune the degree of parallelism and determine the optimal value for a given model architecture and hardware platform. Amazon Cloud Technologies calls this capability Inference Adaptive Parallelism.

 OpenChatKit model

 The OpenChatKit basic model implementation contains the following four files:

 model.py - This file implements the processing logic of the OpenChatKit GPT-NeoX main model. This file receives the inference input request, loads the model, loads the Wikipedia index, and provides the response. model.py uses the following key classes:

  • OpenChatKitService - This class handles data transfer between GPT-NeoX models, Faiss searches and conversation objects. The WikipediaIndex and Conversation objects are initialized and the entered chat session is sent to the index to search for relevant content from Wikipedia. This class also generates a unique ID for each call if no ID is provided for storing prompt information in Amazon DynamoDB.

  • ChatModel - This class loads the model and tokenizer and generates the response. This class handles model partitioning between multiple GPUs using tensor_parallel_degree and configures dtypes and device_map. The prompt information is passed to the model to generate a response. StopWordsCriteria is configured for the build operation so that only bot responses are generated at inference time.

  • ModerationModel - Two moderation models are used in the ModerationModel class: the input model, used to indicate to the chat model that the input is not suitable for overriding the inference results; and the output model, used for overriding the inference results. Use the following possible labels to categorize input prompts and output responses:

  • random

  • Need to be cautious

  • Intervention is required (this is marked as controlled by the model)

  • May need to be cautious

  • Maybe you need to be cautious

 wikipedia_prepare.py - This file is used to download and prepare the Wikipedia index. In this case, Amazon Cloud Technology uses the Wikipedia index provided on the Hugging Face dataset. To search for relevant text in Wikipedia documents, you need to download the index from Hugging Face, as the index is not packaged elsewhere. The wikipedia_prepare.py file is responsible for handling the download when importing. Of the multiple processes running inference, only one can clone the repository. The rest waits until the file appears in the local file system.

 wikipedia.py – This file is used to search the Wikipedia index for contextually relevant documents. Input queries are tokenized and embeds are created using mean_pooling. Amazon Cloud Computes the cosine similarity distance metric between the query embedding and the Wikipedia index to retrieve contextually relevant Wikipedia sentences.

 conversation.py - This file is used to store and retrieve conversation threads in DynamoDB for delivery to models and users. conversation.py is adapted from the open source OpenChatKit repository. This file is responsible for defining the object that stores the dialogue turns between the human and the model. This way, the model maintains a session for the conversation, allowing the user to refer back to previous messages. Because SageMaker endpoint calls are stateless, this conversation needs to be stored somewhere outside the endpoint instance. On startup, the instance creates the DynamoDB table if it does not exist. All updates to the conversation are then stored in DynamoDB based on the session_id key generated by the endpoint. Any call with a session ID will retrieve the associated conversation string and update it as needed.

 Build an LMI inference container with custom dependencies

 Index searches use Facebook's Faiss library for similarity searches. Since this library is not included in the base LMI image, the container needs to be adapted to install it. The following code defines a Dockerfile that installs Faiss and other libraries required for the robot endpoint from source code. Use the sm-docker utility to build an image from Amazon SageMaker Studio and push the image to Amazon Elastic Container Registry (Amazon ECR).

 The DJL container does not have Conda installed, so Faiss needs to be cloned and compiled from source. To install Faiss, you need to install dependencies that use the BLAS API and Python support. After installing these packages, Faiss is configured to use AVX2 and CUDA before compiling with the installed Python extension.

 Pandas, fastparquet, boto3, and git-lfs will be installed later because they are needed to download and read the index file.

 Create model

 Now that you have the Docker image in AmazonECR, you can proceed to create the SageMaker model object for the OpenChatKit model. Deploy the GPT-NeoXT-Chat-Base-20B input and output audit model using GPT-JT-Moderation-6B.

 Configure endpoint

 Next, Amazon Cloud Technologies defines the endpoint configuration for the OpenChatKit model. Deploy the model using the ml.g5.12xlarge instance type.

 Deploy endpoint

 Finally, create the endpoint using the model and endpoint configuration defined in the previous steps.

 Run inference from OpenChatKit models

 Now it's time to send an inference request to the model and get the response. We pass input text hints and model parameters such as temperature, top_k and max_new_tokens. The quality of the chatbot response depends on the specified parameters, so it is recommended to benchmark the model performance against these parameters to find the best settings for your use case. Input prompts are first sent to the input audit model, and then the output is sent to the ChatModel to generate a response. In this step, the model uses the Wikipedia index to retrieve parts that are contextually relevant to the model as cues for obtaining domain-specific responses from the model. Finally, the model response is sent to the output audit model to check for classification, and the response is returned.

 clean up

 Follow the instructions in the Cleanup section to remove resources provisioned as part of this article to avoid unnecessary charges.

 Summarize

 In this article, we discussed the importance of open source LLM and how to deploy OpenChatKit models on SageMaker to build next-generation chatbot applications. Amazon Cloud Technology discussed the various components of the OpenChatKit model, the audit model, and how to use external knowledge sources such as Wikipedia for the Retrieval Augmented Generation (RAG) workflow. Step-by-step instructions can be found in the GitHub notebook.

Guess you like

Origin blog.csdn.net/caijingshiye/article/details/133375547