Generative AI New World | Analysis of the deployment method of Falcon 40B open source large model

eb82a1af5160016578f20805f6432960.gif

The author of this article  is Huang Haowen 

Amazon Cloud Technology Senior Developer Evangelist

In the previous article, we discussed how to fine-tune the model on a custom data set . In this article, we will return to the large model deployment scenario of text generation and discuss how to deploy the Falcon 40B open source large model with 40 billion parameters on Amazon SageMaker.

We will compare two different deployment methods:

  1. Out-of-the-box Amazon SageMaker JumpStart deployment;

  2. Amaon SageMaker Notebook deployment method with more granular control.

Falcon 40B open source large model overview

Flacon 40B is a large language model developed by the Technological Innovation Institute (TII) in the United Arab Emirates. Released in February 2023, it is one of the largest open source large language models currently available. Flacon 40B has 40 billion parameters, more than both GPT-3 and LLaMA. The Flacon 40B model was trained on a variety of text and code datasets, including the RefinedWeb dataset, which is a filtered version of the Common Crawl dataset.

1. Features of Flacon 40B

The Falcon 40B has several features that make it a powerful large-scale language model, including:

  • Large size: Flacon 40B has 40 billion parameters, which enables it to learn more complex relationships between words and concepts

  • Efficient training: Falcon 40B uses several technologies to make its training more efficient, such as 3D parallelism and ZeRO optimization

  • Advanced Architecture: Falcon 40B uses advanced architecture, including FlashAttention and Multi-query Attention. These techniques enable Flacon 40B to better understand long-distance dependencies in text

  • Open Source: Flacon 40B is open source, allowing researchers and developers to experiment and improve it

2. Training number of Flacon 40B

Falcon-40B was trained on 100 billion tokens from RefinedWeb, a filtered and deduplicated high-quality web dataset. It is worth mentioning that the Falcon team believes that the data quality of the RefinedWeb dataset they use is very good. For this reason, they also published a paper, as follows:

c8c8179466335c854bd24684d138845f.png

Source:https://arxiv.org/pdf/2306.01116.pdf,2023/06

3. Training parameters and process of Flacon 40B

Falcon-40B was trained using Amazon SageMaker, using 384 A100 40GB GPUs in a p4d instance. During the training process, Falcon-40B used the 3D parallelism strategy (TP=8, PP=4, DP=12) and ZeRO. Model training started in December 2022 and lasted two months. Its main training parameters are as follows:

07667c31ae0d13c7ab943a98102b5729.png

Source: https://huggingface.co/tiiuae/falcon-40b

4. Model structure of Flacon 40B

Falcon-40B is a causal decoder-only model trained on the causal language modeling task of predicting the next token. This architecture mainly refers to the GPT-3 paper (Brown et al., 2020) and made the following major improvements:

  1. Positional embeddings : using rotary positional embeddings (paper: Su et al., 2021)

  2. Attention mechanism (Attention): using multiquery (paper: Shazeer et al., 2019) and FlashAttention (paper: Dao et al., 2022)

  3. Decoder module: uses parallel attention/MLP with two layers of norms

Its published hyperparameter configuration is as follows:

e5848d813a69bc76c63fa6185736bae5.png

Source: https://huggingface.co/tiiuae/falcon-40b

5. Performance of Flacon 40B

Falcon 40B has been shown to outperform other LLMs on several benchmarks, including GLUE, SQuAD, and RACE. It has also proven effective for a variety of tasks such as text generation, machine translation, and question answering.

The main parameters of the Falcon 40B model are as follows:

  • Parameters: 40 billion

  • Training data: 1 trillion Token

  • Architecture: Transformer

  • Optimizer: Adam

  • Loss Function: Cross Entropy

  • Evaluation indicators: BLEU, ROUGE, F1

Deployment method one:

Using Amazon SageMaker 

JumpStart  to deploy

This section will introduce how to use the SageMaker Python SDK to deploy the Falcon 40B open source large model to generate text in Amazon SageMaker JumpStart. This example includes:

  1. Set up development environment

  2. Get the Hugging Face id and version of the open source large model of the new Falcon 40B

  3. Use the JumpStartModel function to deploy the Falcon 40B large model

  4. Conduct inference and talk to models (including code generation, question answering, translation, and more)

  5. clean up the environment

1. Start the Amazon SageMaker JumpStart environment

1. Enter "Amazon SageMaker" in the Amazon Cloud Technology Console .

2e2f9e2c53e05ef049261e867f926cbd.png

2. Click "Studio" and then "Open Studio" .

6468f98449c1265eacaaf97de54259dc.png

3. Click "Launch -> Studio" .

70b027e135e2ac77243a4dc222e5fac2.png

4. Wait for Amazon SageMaker Studio to start.

0aa4dbbca66009ae69dcd9ebe12b1283.png

5. After clicking "SageMaker JumpStart -> Models, notebooks, solutions", select "Text Models -> Falcon 40B Instruct BF16".

49d6fad3208f5522dcf533c185e8de11.png

e981594bbc5d9fb0c4a79b6636092838.png

7. After "Starting notebook kernel..." is started, you can execute the sample code for deploying the Falcon 40B open source large model!

029c37dad3517f3d174a1cc8a5c0b3bf.png

The complete code for this experiment can be obtained in the SageMaker code library of Amazon Cloud Technology.

The GitHub address of the complete code for this experiment is as follows:

https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_amazon_algorithms/jumpstart-foundation-models/text-generation-falcon.ipynb

Interested developers can refer to the above example and execute the code unit by unit to complete the experiment. Since this notebook of Amazon Cloud Technology is written clearly and concisely, we will not go into the detailed code details here. Interested readers can refer to the previous steps to establish an execution environment and experience it for themselves.

Deployment method two:

Using Amazon SageMaker 

Notebook  to deploy

This section will introduce an example of how to use the new Hugging Face LLM inference container to deploy an open source large language model, such as Falcon 40B, to Amazon SageMaker for inference. This example includes:

  1. Set up development environment

  2. Get the new Hugging Face LLM DLC

  3. Deploy Falcon 40B to Amazon SageMaker

  4. Make inferences and talk to models

  5. clean up the environment

1. Set up the development environment

We will use the Amazon SageMaker python SDK to deploy Falcon 40B to the endpoint for model inference. We first need to ensure that the Amazon SageMaker python SDK is installed correctly. As shown in the following code:

# install supported sagemaker SDK
!pip install "sagemaker>=2.175.0" --upgrade –quiet


import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()


try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']


sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)


print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

Swipe left to see more

For more detailed configuration instructions on the IAM roles required for Amazon SageMaker, you can refer to this document:

https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.htm

2. Get Hugging Face LLM DLC

Hugging Face LLM DLC is a new purpose-built inference container that makes it easy to deploy LLM in a secure hosting environment. DLC is powered by Text Generative Inference (TGI), an open source, purpose-built solution for deploying and serving large language models (LLMs). TGI uses tensor parallelism and dynamic batching to enable high-performance text generation for the most popular open source LLM. With the new Hugging Face LLM Inference DLC launched on Amazon SageMaker, customers can get an LLM experience that supports high concurrency and low latency.

  • Text Generative Inference (TGI)

    https://github.com/huggingface/text-generation-inference

Compared to deploying a regular Hugging Face model, we first need to retrieve the container uri and provide it to the HuggingFaceModel model class and use the image_uri to point to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, we can use the get_huggingface_llm_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI of the desired Hugging Face LLM DLC based on the specified backend, session, region and version.

All available HuggingFace LLM DLC versions can be found at:

https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers

from sagemaker.huggingface import get_huggingface_llm_image_uri


# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.9.3"
)


# print ecr image uri
print(f"llm image uri: {llm_image}")

Swipe left to see more

3. Deploy Falcon 40B to

Amazon SageMaker terminal section

To deploy Falcon 40b Instruct to Amazon SageMaker, we need to create the HuggingFaceModel model class and define related endpoint configurations, including hf_model_id, instance_type, etc. For this demonstration, we will use the g5.12xlarge instance type with 4 NVIDIA A10G GPUs and 96GB GPU memory.

Additionally, Amazon SageMaker quotas may vary by account. If the quota is exceeded, you can increase the quota through the following service quota console:

https://console.aws.amazon.com/servicequotas/home/services/sagemaker/quotas

The deployment code looks like this:

import json
from sagemaker.huggingface import HuggingFaceModel


# sagemaker config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300


# TGI config
config = {
  'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", # model_id from hf.co/models
  'SM_NUM_GPUS': json.dumps(number_of_gpu), # Number of GPU used per replica
  'MAX_INPUT_LENGTH': json.dumps(1024),  # Max length of input text
  'MAX_TOTAL_TOKENS': json.dumps(2048),  # Max length of the generation (including input text)
  # 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize
}


# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

Swipe left to see more

Careful readers will find a commented out line of code in the above example code:

# 'HF_MODEL_QUANTIZE': "bitsandbytes", # comment in to quantize

Swipe left to see more

The category of knowledge about quantization is another interesting and grand area of ​​knowledge, which we will elaborate on in the next section of the fine-tuning of the Falcon 40B large model.

After creating the HuggingFaceModel, we can use the deploy method to deploy it to the Amazon SageMaker terminal node. We will use the ml.g5.12xlarge instance type to deploy the model. Text Generative Inference (TGI) will automatically distribute and shard the model across all GPUs, as shown in the following code:

  • Text Generative Inference (TGI) 

    https://github.com/huggingface/text-generation-inference

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  # volume_size=400, # If using an instance with local SSD storage, volume_size must be None, e.g. p4 but not p3
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

Swipe left to see more

4. Make inferences and talk to the model

After deploying the endpoint, we can use the predict method to start model inference.

We can control generation using different parameters, which can be defined in the parameters attribute of the payload. The Hugging Face LLM DLC inference container supports various generation parameters, including top_p, temperature, stop, max_new_token, and more.

You can find the complete list of supported parameters in the following documentation:

https://huggingface.co/blog/sagemaker-huggingface-llm#4-run-inference-and-chat-with-our-model

As of today, TGI supports the following parameters:

temperature: Controls randomness in the model. Lower values ​​will make the model more deterministic, while higher values ​​will make the model more stochastic. The default value is 0.

max_new_tokens: The maximum number of tokens to be generated. The default value is 20 and the maximum value is 512.

repetition_penalty: Controls the possibility of repetition, defaults to null.

seed: used for randomly generated seeds, default is null.

stop: A list of tokens used to stop generation. After one of the tokens is generated, the generation will stop.

top_k: The number of highest probability vocabulary tokens retained during top-k filtering. The default value is null, which disables top-k filtering.

top_p: The cumulative probability of the highest probability vocabulary tag of the parameter retained during kernel sampling. The default is null.

do_sample: whether to use sampling; otherwise use greedy decoding. The default value is false.

best_of: Generate the best_of sequence and return the sequence if it is the highest mark logprobs, the default is null.

details: Whether to return detailed information about the build. The default value is false.

return_full_text: Whether to return the full text or only the generated part. The default value is false.

truncate: Whether to truncate the input to the maximum length of the model. The default value is true.

typical_p: typical probability of token. The default value is null.

watermark: The watermark used during generation. The default value is false.

Because the tiiuae/falcon-40b-instruct open source large model we deployed is a dialogue chat model, we can use the following prompt words to chat with the large model! As follows:

# define payload
prompt = """You are an helpful Assistant, called Falcon. Knowing everyting about AWS.


User: Can you tell me something about Amazon SageMaker?
Falcon:"""


# hyperparameters for llm
payload = {
  "inputs": prompt,
  "parameters": {
    "do_sample": True,
    "top_p": 0.9,
    "temperature": 0.8,
    "max_new_tokens": 1024,
    "repetition_penalty": 1.03,
    "stop": ["\nUser:","<|endoftext|>","</s>"]
  }
}


# send request to endpoint
response = llm.predict(payload)


# print assistant respond
assistant = response[0]["generated_text"][len(prompt):]

Swipe left to see more

The output of LLM is shown below. It generates a paragraph describing "Amazon SageMaker":

54b6f70eedb495a6b3d9c277aa9b4b54.png

For the convenience of readers, I copy the output of LLM as follows:

'Amazon SageMaker is a fully managed platform that enables developers and data scientists to quickly build, train, and deploy machine learning models in the cloud. It provides a wide range of tools and services, including Jupyter notebooks, algorithms, pre-trained models, and easy-to-use APIs, so you can quickly get started building machine learning applications.'

We can continue to ask Falcon 40B large model questions, such as:

new_prompt = f"""{prompt}{assistant}
User: How would you recommend start using Amazon SageMaker? If i am new to Machine Learning?
Falcon:"""
# update payload
payload["inputs"] = new_prompt


# send request to endpoint
response = llm.predict(payload)


# print assistant respond
new_assistant = response[0]["generated_text"][len(new_prompt):]
print(new_assistant)

Swipe left to see more

The answer given to me by the Falcon 40B large model is as follows for your reference:

3b5c8c52913b57013a72bf354fc7b5a2.png

For the convenience of readers, I copy the output of LLM as follows:

'If you're new to machine learning, you can start with pre-built algorithms and pre-trained models available in Amazon SageMaker. You can also use Jupyter notebooks to create and run your own experiments. Additionally, you can take advantage of the AutoPilot feature to automatically build and train machine learning models based on your data. The best way to get started is to experiment and try different things to see what works best for your specific use case.'

5. Delete resources and clean up the environment

We have deployed the Falcon 40B open source large model to the Amazon SageMaker endpoint and successfully performed model inference. After completing this experiment, please remember to delete resources and clean up the environment, including deleting models and endpoints, to avoid unnecessary charges.

Sample code to delete resources and clean up the environment is as follows:

llm.delete_model()
llm.delete_endpoint()

6. Reference documents

The deployment method of this section mainly refers to the following English documents. During the explanation process, the author made some detailed descriptions and text adjustments:

https://www.philschmid.de/sagemaker-falcon-llm

Compare and summarize

This article is divided into two chapters, deploying Falcon 40B's open source large language model in two ways.

First, we used Amazon SageMaker JumpStart to deploy the model. The main core code is as follows:

model_id, model_version = "huggingface-llm-falcon-40b-instruct-bf16", "*”


from sagemaker.jumpstart.model import JumpStartModel
my_model = JumpStartModel(model_id=model_id)
predictor = my_model.deploy()

Swipe left to see more

Secondly, we used Amazon SageMaker Notebook to deploy the model. Its main core code is as follows:

# Retrieve the new Hugging Face LLM DLC
from sagemaker.huggingface import get_huggingface_llm_image_uri


# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface",
  version="0.8.2"
)


# print ecr image uri
printf("llm image uri: {llm_image}")
# Deploy Falcon 40B Model
from sagemaker.huggingface import HuggingFaceModel


# instance config
instance_type = "ml.g5.12xlarge"
number_of_gpu = 4
health_check_timeout = 300


# TGI config
config = {
      'HF_MODEL_ID': "tiiuae/falcon-40b-instruct", 
      ……
}


# create HuggingFaceModel
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)
llm = llm_model.deploy(
      ……
)

Swipe left to see more

From the above comparison of the amount of core code, we can see that if you are a beginner and want to use it out of the box, you can choose Amazon SageMaker JumpStart, a fast and concise deployment method; if you already have a certain understanding of the Amazon SageMaker service and want to use it in When the large model deployment process has finer-grained control (for example: deployment instance type, image version number, TGI parameters, etc.), you can choose Amazon SageMaker Notebook, a deployment method that has more comprehensive control over configuration parameters.

3da7700ce1766dd13f20c933ffc7467c.gif

In the next article, we'll explore the topic of using Amazon SageMaker Notebook to quickly and efficiently fine-tune large language models in an interactive environment. We will use QLoRA and 4-bits bitsandbtyes quantization techniques to fine-tune the Falcon-40B model using Hugging Face PEFT on Amazon SageMaker. This topic is a cutting-edge pioneer topic in the current field of open source large models, so please stay tuned.

Please continue to follow the "Amazon Cloud Developer" WeChat official account to learn more about technology sharing and cloud development trends for developers!

The author of this article

77796c35dd71debc0820717dfef238a8.jpeg

Huang Haowen

Senior developer evangelist at Amazon Cloud Technology, focusing on AI/ML, Data Science, etc. He has more than 20 years of experience in architecture design, technology and entrepreneurial management in industries such as telecommunications, mobile Internet and cloud computing. He has worked for Microsoft, Sun Microsystems, China Telecom and other companies, focusing on providing services to corporate customers such as games, e-commerce, media and advertising. Solution consulting services such as AI/ML, data analysis and enterprise digital transformation.

4df13d14e0aaa6feae3831e59ce599ba.gif

5e488bb5082647d7fb19e61ba5c14e30.gif

I heard, click the 4 buttons below

You won’t encounter bugs!

0ad2fb92f5fd6defb22300474ba9b354.gif

Guess you like

Origin blog.csdn.net/u012365585/article/details/132704203