Deploy Llama 2 7B/13B/70B on Amazon SageMaker
LLaMA 2 is the next version of the LLaMA. It is trained on more data - 2T tokens and supports context length window upto 4K tokens. Meta fine-tuned conversational models with Reinforcement Learning from Human Feedback on over 1 million human annotations.
In this blog you will learn how to deploy Llama 2 model to Amazon SageMaker. Your are going to use the Hugging Face LLM DLC is a new purpose-built Inference Container to easily deploy LLMs in a secure and managed environment. The DLC is powered by Text Generation Inference (TGI) a scalelable, optimized solution for deploying and serving Large Language Models (LLMs). The Blog post also includes Hardware requirements for the different model sizes.
In the blog will cover how to:
Lets get started!
1. Setup development environment
You are going to use the
sagemaker
python SDK to deploy Llama 2 to Amazon SageMaker. You need to make sure to have an AWS account configured and the
sagemaker
python SDK installed.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
# set to default bucket if a bucket name is not given
sagemaker_session_bucket = sess.default_bucket()
try:
role = sagemaker.get_execution_role()
except ValueError:
iam = boto3.client('iam')
role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']
sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)
print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")
2. Retrieve the new Hugging Face LLM DLCCompared to deploying regular Hugging Face models you first need to retrieve the container uri and provide it to our HuggingFaceModel
model class with a image_uri
pointing to the image. To retrieve the new Hugging Face LLM DLC in Amazon SageMaker, you can use the get_huggingface_llm_image_uri
method provided by the sagemaker
SDK. This method allows us to retrieve the URI for the desired Hugging Face LLM DLC based on the specified backend
, session
, region
, and version
. You can find the available versions here
3. Hardware requirementsLlama 2 comes in 3 different sizes - 7B, 13B & 70B parameters. The hardware requirements will vary based on the model size deployed to SageMaker. Below is a set up minimum requirements for each model size we tested.
Note: We haven't tested GPTQ models yet.
Model Instance Type Quantization # of GPUs per replica Llama 7B (ml.)g5.2xlarge
-
1 Llama 13B (ml.)g5.12xlarge
-
4 Llama 70B (ml.)g5.48xlarge
bitsandbytes
8 Llama 70B (ml.)p4d.24xlarge
-
8
Note: Amazon SageMaker currently doesn't support instance slicing meaning, e.g. for Llama 70B you cannot run multiple replica on a single instance.
These are the minimum setups we have validated for 7B, 13B and 70B LLaMA 2 models to work on SageMaker. In the coming weeks, we plan to run detailed benchmarking covering latency and throughput numbers across different hardware configurations. We are currently not recommending deploying Llama 70B to g5.48xlarge instances, since long request can timeout due to the 60s request timeout limit for SageMaker. Use p4d
instances for deploying Llama 70B it.
It might be possible to run Llama 70B on g5.48xlarge
instances without quantization by reducing the MAX_TOTAL_TOKENS
and MAX_BATCH_TOTAL_TOKENS
parameters. We haven't tested this yet.
4. Deploy Llama 2 to Amazon SageMakerTo deploy meta-llama/Llama-2-13b-chat-hf to Amazon SageMaker you create a HuggingFaceModel
model class and define our endpoint configuration including the hf_model_id
, instance_type
etc. You will use a g5.12xlarge
instance type, which has 4 NVIDIA A10G GPUs and 96GB of GPU memory.
Note: This is a form to enable access to Llama 2 on Hugging Face after you have been granted access from Meta. Please visit the Meta website and accept our license terms and acceptable use policy before submitting this form. Requests will be processed in 1-2 days.