Amazon Cloud Technology uses Inf2 instance to run GPT-J-6B model

At the Amazon Cloud Technology re:Invent in 2019, Amazon Cloud Technology released two infrastructures, the Inferentia chip and the Inf1 instance. Inferentia is a high-performance machine learning inference chip, custom-designed by Amazon Cloud Technology, and its purpose is to provide cost-effective large-scale low-latency prediction. After four years, in April 2023, Amazon Cloud Technology released the Inferentia2 chip and Inf2 instance, aiming to provide technical support for large-scale model reasoning.

a9bd73ffac1745ecab58552bb4ce2294.png

 

Application Scenarios of Inf2 Instances

Use Amazon Cloud Technology Inf2 instances to run popular applications such as text summarization, code generation, video and image generation, speech recognition, personalization, and more. Inf2 instances are the first inference-optimized instances in Amazon EC2, introducing scale-out distributed inference powered by NeuronLink, a high-speed, non-blocking interconnect. Models with hundreds of billions of parameters can now be efficiently deployed across multiple accelerators on Inf2 instances. Inf2 instances deliver three times higher throughput, eight times lower latency, and 40 percent better price/performance than other similar Amazon EC2 instances. To meet sustainability goals, Inf2 instances deliver 50 percent higher performance per watt compared to other similar Amazon EC2 instances.

 

Run the GPT-J-6B model with an Inf2 instance

GPT-J-6B is an open-source autoregressive language model created by a group of researchers called EleutherAI. It is one of the most advanced alternatives to OpenAI's GPT-3 and performs well on a wide range of natural language tasks such as chat, summarization, and question answering.

The model consists of 28 layers with a model dimension of 4096 and a feed-forward dimension of 16384. The model dimension is divided into 16 heads, and each head has a dimension of 256. Rotational Position Embedding (RoPE) is applied to each head in 64 dimensions. The model is trained with a tokenized vocabulary of 50257 using the same set of BPEs as GPT-2/GPT-3.

Hyperparameter

Value

n_parameters

6,053,381,344

n_layers

28*

d_model

4,096

d_ff

16,384

n_heads

16

d_head

256

n_ctx

2,048

n_vocab

50,257 (same tokenizer as GPT-2/3)

The GPT-J-6B infrastructure has 6 billion parameters, making it ideal for an introductory version of Large Language Model (LLM) learning, text generation testing. During deployment, Neuron SDK and transformers-neuronx are used. transformers-neuronx is an open-source library built by the AWS Neuron team to help run transformer-decoder inference workflows using the AWS Neuron SDK. Currently, it provides demo scripts for GPT2, GPT-J, and OPT model types, whose forward functions are reimplemented during compilation for code analysis and optimization, and other model architectures can be implemented based on the same library. The AWS Neuron-optimized converter-decoder class has been reimplemented in XLA HLO (Advanced Operations) using a syntax called PyHLO. The library also implements Tensor Parallelism to shard model weights across multiple NeuronCores.

Guess you like

Origin blog.csdn.net/m0_66395609/article/details/130722901