Amazon cloud technology Inferentia infrastructure, providing large-scale low-latency prediction

At Amazon Cloud Technology re:Invent in 2019, Amazon Cloud Technology released the new infrastructure Inferentia chip and Inf1 instance. Inferentia is a high-performance machine learning inference chip, custom-designed by Amazon Cloud Technology, and its purpose is to provide cost-effective large-scale low-latency prediction. After four years, in April 2023, Amazon Cloud Technology released the Inferentia2 chip and Inf2 instance, aiming to provide technical support for large-scale model reasoning.

Inf2 instances offer up to 2.3 petaflops of DL performance with up to 384 GB of total accelerator memory and 9.8 TB/s of bandwidth. The Amazon Cloud Technology Neuron SDK is natively integrated with popular machine learning frameworks such as PyTorch and TensorFlow. Therefore, users can continue to use existing framework and application code to deploy on Inf2. Developers can use Inf2 instances in AWS Deep Learning AMIs, AWS Deep Learning containers, or managed services such as Amazon ECS, Amazon EKS, and Amazon SageMaker.

f3dbcd7f76384e5d90e0488c251edc9b.png

 

At the heart of the Amazon EC2 Inf2 instances are Amazon Cloud Technology Inferentia2 appliances, each containing two NeuronCores-v2. Each NeuronCore-v2 is an independent heterogeneous computing unit with four main engines: tensor (Tensor), vector (Vector), scalar (Scalar) and GPSIMD engine. The tensor engine is optimized for matrix operations. The scalar engine is optimized for element-wise operations such as ReLU (rectified linear unit) functions. The vector engine is optimized for non-element-wise vector operations, including batch normalization or pooling.

Amazon Cloud Technology Inferentia2 supports multiple data types, including FP32, TF32, BF16, FP16 and UINT8, so users can choose the most suitable data type according to the workload. It also supports the new Configurable FP8 (cFP8) data type, which is particularly relevant for large models as it reduces the model's memory footprint and I/O requirements.

Amazon Cloud Technology Inferentia2 embeds a general-purpose digital signal processor (DSP) that supports dynamic execution, so there is no need to unroll or execute control flow operators on the host. AWS Inferentia2 also supports dynamic input shapes, which is critical for models whose input tensor sizes are unknown, such as those dealing with text.

Amazon Cloud Technology Inferentia2 supports custom operators written in C++. Neuron Custom C++Operators enables users to write C++ custom operators that run natively on NeuronCores. Migrate CPU custom operators to Neuron and implement new experimental operators using the standard PyTorch custom operator programming interface, all without requiring deep knowledge of NeuronCore hardware.

Inf2 instances are the first inference-optimized instances on Amazon EC2 to support distributed inference through a direct ultra-high-speed connection (NeuronLink v2) between chips. NeuronLink v2 runs high-performance inference pipelines across all chips using Collective Communications operators such as all-reduce.

 

Neuron SDK

Amazon Cloud Technology Neuron is an SDK that optimizes the performance of complex neural network models executed on Amazon Cloud Technology Inferentia and Trainium. Amazon Cloud Technology Neuron includes deep learning compilers, runtimes, and tools that are natively integrated with popular frameworks such as TensorFlow and PyTorch. It is pre-installed in Amazon Cloud Technology Deep Learning AMIs and Deep Learning Containers for customers to quickly start running high-performance and cost-effective inference.

The Neuron Compiler accepts machine learning models in multiple formats (TensorFlow, PyTorch, XLA HLO) and optimizes them to run on Neuron devices. The Neuron compiler is invoked within the machine learning framework, where the model is sent to the compiler by the Neuron Framework plugin. The resulting compiler artifact is called a NEFF file (Neuron Executable File Format), which in turn is loaded to the Neuron device by the Neuron runtime.

The Neuron runtime consists of kernel drivers and C/C++ libraries that provide APIs to access Inferentia and Trainium Neuron devices. The Neuron ML framework plugin for TensorFlow and PyTorch uses the Neuron runtime to load and run models on NeuronCores. The Neuron runtime loads compiled deep learning models (also known as Neuron Executable File Format (NEFF)) to Neuron devices and is optimized for high throughput and low latency.

おすすめ

転載: blog.csdn.net/m0_66395609/article/details/130682691