Amazon cloud technology infrastructure provides technical support for large-scale model inference

At the 2019 Amazon Cloud Technology re:Invent, Amazon Cloud Technology released new infrastructure Inferentia chips and Inf1 instances. Inferentia is a high-performance machine learning inference chip custom designed by Amazon Cloud Technologies to provide cost-effective low-latency predictions at scale. After four years, in April 2023, Amazon Cloud Technology released the Inferentia2 chip and Inf2 instance, aiming to provide technical support for large-scale model inference.

Inf2 instances offer up to 2.3 petaflops of DL performance and up to 384 GB of total accelerator memory with 9.8 TB/s bandwidth. Amazon Cloud Technology Neuron SDK is natively integrated with popular machine learning frameworks such as PyTorch and TensorFlow. Therefore, users can continue to use existing frameworks and application code to deploy on Inf2. Developers can use Inf2 instances in AWS Deep Learning AMIs, AWS Deep Learning containers, or managed services such as Amazon ECS, Amazon EKS, and Amazon SageMaker.

39c7820f37464a50923ab85bdec819b6.png

 

At the core of Amazon EC2 Inf2 instances are Amazon Cloud Technology Inferentia2 devices, each of which contains two NeuronCores-v2. Each NeuronCore-v2 is an independent heterogeneous computing unit with four main engines: Tensor, Vector, Scalar and GPSIMD engines. The tensor engine is optimized for matrix operations. The scalar engine is optimized for element-wise operations such as ReLU (rectified linear unit) functions. The vector engine is optimized for non-element vector operations, including batch normalization or pooling.

Amazon Cloud Technology Inferentia2 supports multiple data types, including FP32, TF32, BF16, FP16 and UINT8, so users can choose the most appropriate data type based on their workload. It also supports the new configurable FP8 (cFP8) data type, which is particularly relevant for large models as it reduces the model's memory footprint and I/O requirements.

Amazon Cloud Technology Inferentia2 embeds a general-purpose digital signal processor (DSP) that supports dynamic execution, so there is no need to expand or execute control flow operators on the host. Amazon Cloud Technology Inferentia2 also supports dynamic input shapes, which is critical for models with unknown input tensor sizes (such as models that process text).

Amazon Cloud Technology Inferentia2 supports custom operators written in C++. Neuron Custom C++Operators enable users to write C++ custom operators that run natively on NeuronCores. Migrate CPU custom operators to Neuron and implement new experimental operators using the standard PyTorch custom operator programming interface, all without requiring in-depth knowledge of NeuronCore hardware.

Inf2 instances are the first inference-optimized instances on Amazon EC2 to support distributed inference via a direct ultra-high-speed connection between chips (NeuronLink v2). NeuronLink v2 uses Collective Communications operators such as all-reduce to run high-performance inference pipelines across all chips.

 

Neuron SDK

Amazon Neuron is an SDK that optimizes the performance of complex neural network models executing on Amazon Inferentia and Trainium. Amazon Cloud Neuron includes deep learning compilers, runtimes and tools that are natively integrated with popular frameworks such as TensorFlow and PyTorch. It comes pre-installed in Amazon Cloud Deep Learning AMIs and Deep Learning Containers for customers to quickly start running high-performance and cost-effective reasoning.

The Neuron compiler accepts machine learning models in multiple formats (TensorFlow, PyTorch, XLA HLO) and optimizes them to run on Neuron devices. The Neuron compiler is called within the machine learning framework, where the model is sent to the compiler by the Neuron Framework plugin. The resulting compiler artifacts are called NEFF files (Neuron Executable Format), which in turn are loaded to the Neuron device by the Neuron runtime.

The Neuron runtime consists of a kernel driver and a C/C++ library that provides APIs to access Inferentia and Trainium Neuron devices. The Neuron ML framework plugin for TensorFlow and PyTorch uses the Neuron runtime to load and run models on NeuronCores. The Neuron runtime loads compiled deep learning models, also known as Neuron Executable File Format (NEFF), to Neuron devices and is optimized for high throughput and low latency.

Guess you like

Origin blog.csdn.net/m0_71839360/article/details/130722889