Amazon Cloud Technology Amazon SageMaker launches new features to provide powerful, efficient, and more cost-effective generative AI services

The world of artificial intelligence (AI) and machine learning (ML) is witnessing a paradigm shift with the rise of generative AI models capable of creating human-like text, images, code, and audio. Generative AI models are much larger and more complex than traditional machine learning models. However, this model adds complexity, high inference costs, and an increasing need for powerful computing resources. For enterprises and researchers with limited resources, the high inference cost of generative AI models can become a barrier to market entry, thus requiring more efficient and cost-effective solutions. Additionally, most generative AI use cases involve human-computer interaction or real-world scenarios, requiring hardware that can deliver low-latency performance. Amazon Cloud Technologies has been innovating with specialized chips to meet the growing demand for powerful, efficient and affordable computing hardware.

Recently, Amazon Cloud Technology announced that Amazon SageMaker supports SageMaker instances based on AWS Inferentia2 (ml.inf2) and AWS Trainium (ml.trn1) to host generative AI models for real-time reasoning and asynchronous reasoning. The ml.inf2 instance deploys the model on SageMaker in US East (Ohio), and the ml.trn1 instance deploys the model on SageMaker in US East (N. Virginia).

These instances are now available on SageMaker to achieve low-cost, high-performance implementation of generative AI models, including Large Language Models (LLMs), Stable Diffusion, and Vision Transformers. Additionally, Amazon SageMaker Inference Recommender can be used to assist you in running load tests and evaluating the cost-effectiveness of deploying models on these instances. And you can use ml.inf2 and ml.trn1 instances to run machine learning applications on SageMaker for text summarization, code generation, video and image generation, speech recognition, personalization, fraud detection, and more. To get started easily, you can first specify an ml.trn1 or ml.inf2 instance when configuring your SageMaker endpoint. Then use ml.trn1 and ml.inf2 compatible AWS Deep Learning Containers (DLC) for PyTorch, TensorFlow, Hugging Face, and Large Model Inference (LMI).

This article will demonstrate the process of deploying a large language model on AWS Inferentia2 using SageMaker using the LMI container without any additional coding. Use GPT4ALL-J, a fine-tuned GPT-J 7B model that provides chatbot-style interactions.

ml.trn1 and ml.inf2 instances overview

The ml.trn1 instance is powered by the Trainium accelerator, which is primarily used for high-performance deep learning training of generative AI models, including LLM. However, these instances also support larger inference workloads than the models Inf2 is suitable for. The largest instance (trn1.32xlarge instance) is equipped with 16 Trainium accelerators and 512 GB accelerator memory in a single instance, providing up to 3.4 petaflops of FP16/BF16 computing power. The 16 Trainium accelerators are connected via ultra-fast NeuronLinkv2, which simplifies collective communications.

ml.Inf2 instances are powered by the AWS Inferentia2 accelerator, an accelerator built specifically for inference. Compared to the first-generation AWS Inferentia, this accelerator delivers three times the computing performance, four times the throughput, and reduces latency by up to 10 times. The largest instance, Inf2.48xlarge, is equipped with 12 AWS Inferentia2 accelerators and 384 GB accelerator memory in a single instance, with a combined computing power of 2.3 petaflops for BF16/FP16. This enables you to deploy models with up to 175 billion parameters in a single instance. Inf2 is the only inference-optimized instance that offers this interconnect feature, which is only available in more expensive training instances. For very large models that cannot be accommodated by a single accelerator, data can flow directly between accelerators via NeuronLink, bypassing the CPU entirely. With NeuronLink, Inf2 supports faster distributed inference, improves throughput and reduces latency.

Both the AWS Inferentia2 and Trainium accelerators feature two NeuronCores-v2, 32 GB HBM memory stacks, and dedicated collective compute engines that automatically optimize the runtime system through overlapping computation and communication when doing multi-accelerator inference.

AWS Neuron SDK

AWS Neuron is an SDK for running deep learning workloads on AWS Inferentia and Trainium-based instances. AWS Neuron includes deep learning compilers, runtime systems, and tools natively integrated into TensorFlow and PyTorch. With Neuron, high-performance machine learning workloads can be developed, analyzed, and deployed on ml.trn1 and ml.inf2.

The Neuron compiler accepts machine learning models in various formats (TensorFlow, PyTorch, XLA HLO) and optimizes these models to run on Neuron devices. The Neuron compiler is called within a machine learning framework, and within that framework the machine learning model is sent to the compiler by the Neuron framework plugin. The resulting compiler artifact is called a NEFF file (Neuron Executable File Format), which is loaded into the Neuron device by the Neuron runtime system.

The Neuron runtime system consists of kernel drivers and C/C++ libraries that provide APIs to access AWS Inferentia and Trainium Neuron devices. The Neuron machine learning framework plugin for TensorFlow and PyTorch uses the Neuron runtime system to load and run models on NeuronCore. The Neuron runtime system loads compiled deep learning models (NEFF) onto Neuron devices and is optimized for high throughput and low latency.

Hosting NLP models using a SageMaker ml.inf2 instance

transformers-neuronx is an open source library that can shard a model's large weight matrix onto multiple NeuronCores. Before delving into how to use this library to serve LLM, let's take a brief look at the typical deployment process of a model that can be used on a single NeuronCore. .

Check the list of supported models to make sure AWS Inferentia2 supports the model. Next, the model needs to be precompiled using the Neuron compiler. Models can be compiled using a SageMaker notebook or an Amazon Elastic Compute Cloud (Amazon EC2) instance. Models can be deployed using popular deep learning frameworks such as PyTorch with the help of the SageMaker Python SDK. Models can be deployed to a SageMaker hosting service and have endpoints available for inference. These endpoints are fully managed and support autoscaling.

Hosting LLM using a SageMaker ml.inf2 instance

Large language models often have billions of parameters and are too large to be accommodated by a single accelerator. This requires using model parallelism technology to host LLM on multiple accelerators. Another key requirement for hosted LLM is the implementation of a high-performance model serving solution. The solution should be able to efficiently load models, manage partitions, and seamlessly handle requests via HTTP endpoints.

SageMaker includes specialized Deep Learning Containers (DLC), libraries and tools for model parallelization and large model inference. SageMaker maintains DLC using popular open source libraries to host large models such as GPT, T5, OPT, BLOOM, and Stable Diffusion on Amazon Cloud infrastructure. These specialized DLCs are called SageMaker LMI containers.

The SageMaker LMI container uses DJLServing, a model server integrated with the transformers-neuronx library to support tensor parallelism between NeuronCores. The DJL model server and transformers-neuronx library are core components of the container, which also includes the Neuron SDK. This setup facilitates loading the model onto the AWS Inferentia2 accelerator, processing the model in parallel on multiple NeuronCores, and serving it via HTTP endpoints.

LMI containers support loading models from Amazon Simple Storage Service (Amazon S3) buckets or Hugging Face Hub. The default handler script loads the model, compiles and converts the model to Neuron optimized format, and then loads the model again. To host LLM using an LMI container, you have two options:

No-code (preferred) - This is the easiest way to deploy LLM using LMI containers. In this approach you can use the provided default handler and simply pass the model name and the required parameters in the serving.properties file to load and host the model. To use the default handler, we need to set the entryPoint parameter to djl_python.transformers-neuronx.

Bring your own script - In this approach, you have the option to create your own model.py file containing the code required to load and serve the model. This file acts as an intermediary between the DJLServing API and the transformers-neuronx API. To customize the model loading process, you can provide configurable parameters to serving.properties.

Runtime system architecture

The tensor_parallel_degree attribute value determines the distribution of tensor parallel modules across multiple NeuronCores. For example, inf2.24xlarge has six AWS Inferentia2 accelerators. Each AWS Inferentia2 accelerator has two NeuronCores. Each NeuronCore has a 16 GB dedicated high-bandwidth memory (HBM) that stores tensor parallel modules. When tensor parallelism is 4, LMI will allocate three model copies of the same model, using 4 NeuronCores for each copy. As shown in the figure below, when the LMI container starts, the model will first be loaded and tracked in CPU-addressable memory. Once tracking is complete, the model is partitioned across NeuronCores based on tensor parallelism.

LMI uses DJLServing as its model serving stack. After the container's health check passes in SageMaker, the container can handle inference requests. DJLServing starts multiple Python processes equivalent to TOTAL NUMBER OF NEURON CORES/TENSOR_PARALLEL_DEGREE. Each Python process contains C++ threads equivalent to TENSOR_PARALLEL_DEGREE. Each C++ thread saves a model shard on a NeuronCore.

When calling the server with multiple independent requests, many practitioners (Python processes) tend to run inference sequentially. Although easier to set up, utilizing the computing power of an accelerator is generally not best practice. To solve this problem, DJLServing provides built-in optimization capabilities for dynamic batching, which can dynamically merge these independent inference requests into a larger batch on the server side to improve throughput. All requests arrive at the dynamic batch processor before entering the actual job queue for inference. You can set your preferred batch size for dynamic batching using the batch_size setting in serving.properties. You can also configure max_batch_delay to specify the maximum delay in the batch processor waiting for other requests to join the batch, based on the latency requirements. Throughput also depends on the number of model replicas and the group of Python processes started in the container. As shown in the figure below, when the tensor parallelism is set to 4, the LMI container starts three Python process groups, each process group containing a complete copy of the model. This allows you to increase the batch size and achieve higher throughput.

SageMaker notebook for deploying LLM

In this section, we demonstrate step-by-step how to deploy GPT4All-J, a 6 billion parameter model with 24 GB in FP32. GPT4All-J is a popular chatbot trained on a variety of interactive content including word questions, conversations, code, poetry, songs, and stories. GPT4all-J is a fine-tuned GPT-J model that produces responses similar to human interactions.

A complete notebook example is available on GitHub. Models can be deployed to Inf2 instances using the SageMaker Python SDK. We use the provided default handler to load the model. This way, we only need to provide a servings.properties file. This file has the configuration required by the DJL model server to download and host the model. We can specify the name of the Hugging Face model using the model_id parameter to download the model directly from the Hugging Face repository. Alternatively, you can download the model from Amazon S3 by providing the s3url parameter. The entryPoint parameter is configured to point to the library used to load the model.

The tensor_parallel_degree attribute value determines the distribution of tensor parallel modules across multiple devices. For example, if there are 12 NeuronCores and the tensor parallelism is 4, then LMI will allocate 3 copies of the model, each using 4 NeuronCores. You can also define the precision type using the attribute dtype. The n_position parameter defines the maximum sum of the input and output sequence lengths of the model.

Summarize

In summary, this article showcases the new features of SageMaker, which now supports ml.inf2 and ml.trn1 instances to host generative AI models. And demonstrates how to deploy the generative AI model GPT4ALL-J on AWS Inferentia2 using SageMaker and LMI containers without writing any code. It also shows how to load, partition and serve models using DJLServing and transformers-neuronx.

Guess you like

Origin blog.csdn.net/weixin_53378048/article/details/133065202