如何使用 Hugging Face LLM DLC 部署大型语言模型到 Amazon SageMaker

link管理

链接快照平台

输入网页链接，自动生成快照
标签化管理网页链接

相关文章推荐

健壮的眼镜 · GitHub - ...· 1 周前 ·

急躁的煎鸡蛋 · iframe 这些属性你可能不知道 - 前端路迹· 6 天前 ·

千年单身的红烧肉 · 寄生虫电影百度云 - 百度· 5 天前 ·

聪明伶俐的围巾 · fp = ...· 5 天前 ·

幸福的帽子 · qml实现获取选中的文件夹里的所有文件 - ...· 4 天前 ·

面冷心慈的弓箭 · 国创动画《天宝伏妖录》上线B站-电视-东方娱 ...· 2 周前 ·

腼腆的茶叶 · 2018五一香港旅游攻略景点美食特产大全- ...· 3 月前 ·

大气的扁豆 · 好声音来自好声带_健康知识_天津市卫生健康委员会· 4 月前 ·

不羁的番茄 · APP应用测试方法以及测试思路- 知乎· 1 年前 ·

飘逸的登山鞋 · 无能娜娜：经典体育仓库事件再现，不过这一次的 ...· 1 年前 ·

本篇文章主要介绍如何使用新的 Hugging Face LLM 推理容器将开源 LLMs，比如 [BLOOM](https://huggingface.co/bigscience/bloom?trk=cndc-detail) 部署到亚马逊 SageMaker 进行推理的示例。我们将部署 12B [Open Assistant Model](https://open-assistant.io/?trk=cndc-detail)，这是一款由开放助手计划训练的开源 Chat LLM。这个示例包括: 1. 设置开发环境 2. 获取全新 Hugging Face LLM DLC 3. 将开放助手 12B 部署到亚马逊 SageMaker 4. 进行推理并与我们的模型聊天 5. 清理环境 ### 什么是 Hugging Face LLM Inference DLC？ Hugging Face LLM DLC 是一款全新的专用推理容器，可在安全的托管环境中轻松部署 LLM。DLC 由[文本生成推理（TGI）](https://github.com/huggingface/text-generation-inference?trk=cndc-detail)提供支持，这是一种用于部署和服务大型语言模型（LLM）的开源、专门构建的解决方案。TGI 使用张量并行和动态批处理为最受欢迎的开源 LLM（包括 StarCoder、BLOOM、GPT-Neox、Llama 和 T5）实现高性能文本生成。文本生成推理已被 IBM、Grammarly 等客户使用，Open-Assistant 计划对所有支持的模型架构进行了优化，包括： * 张量并行性和自定义 cuda 内核 * 在最受欢迎的架构上使用 [flash-attention](https://github.com/HazyResearch/flash-attention?trk=cndc-detail) 优化了用于推理的变形器代码 * 使用 [bitsandbytes](https://github.com/TimDettmers/bitsandbytes?trk=cndc-detail) 进行量化 * [连续批处理传入的请求](https://github.com/huggingface/text-generation-inference/tree/main/router?trk=cndc-detail) 以增加总吞吐量 * 使用 [safetensors](https://github.com/huggingface/safetensors?trk=cndc-detail) 加速重量加载（启动时间） * Logits 扭曲器（温度缩放、topk、重复惩罚…） * 用[大型语言模型的水印](https://arxiv.org/abs/2301.10226?trk=cndc-detail)添加水印 * 停止序列，记录概率 * 使用服务器发送事件（SSE）进行 Token 流式传输 **官方支持的模型架构目前为：** * [BLOOM](https://huggingface.co/bigscience/bloom?trk=cndc-detail)/[BLOOMZ](https://huggingface.co/bigscience/bloomz?trk=cndc-detail) * [MT0-XXL](https://huggingface.co/bigscience/mt0-xxl?trk=cndc-detail) * [Galactica](https://huggingface.co/facebook/galactica-120b?trk=cndc-detail) * [SantaCoder](https://huggingface.co/bigcode/santacoder?trk=cndc-detail) * [gpt-Neox 20B](https://huggingface.co/EleutherAI/gpt-neox-20b?trk=cndc-detail)（joi、pythia、lotus、rosey、chip、redPajama、open Assistant） * [FLAN-T5-XXL](https://huggingface.co/google/flan-t5-xxl?trk=cndc-detail)（T5-11B） * [Llama](https://github.com/facebookresearch/llama?trk=cndc-detail)（vicuna、alpaca、koala） * [Starcoder](https://huggingface.co/bigcode/starcoder?trk=cndc-detail)/[santaCoder](https://huggingface.co/bigcode/santacoder?trk=cndc-detail) * [Falcon 7B](https://huggingface.co/tiiuae/falcon-7b?trk=cndc-detail)/[Falcon 40B](https://huggingface.co/tiiuae/falcon-40b?trk=cndc-detail) 借助亚马逊 SageMaker 上推出的全新 Hugging Face LLM Inference DLC，AWS 客户可以从支持高度并发、低延迟 LLM 体验的相同技术中受益，例如 [HuggingChat](https://hf.co/chat?trk=cndc-detail)、[OpenAssistant](https://open-assistant.io/?trk=cndc-detail) 和 Hugging Face Hub 上的 LLM 模型推理 API。让我们开始吧！ ### 1.设置开发环境我们将使用 SageMaker python SDK 将 OpenAssistant/pythia-12b-sft-v8-7k-steps 部署到亚马逊 SageMaker。我们需要确保配置一个 AWS 账户并安装 SageMaker python SDK。 # TODO: once PR is merged: https\://github.com/aws/sagemaker-python-sdk/pull/3837/files !pip install git+https\://github.com/xyang16/sagemaker-python-sdk.git\@hf --upgrade #!pip install sagemaker --upgrade --quiet 如果你打算在本地环境中使用 SageMaker。您需要访问具有 SageMaker 所需权限的 IAM 角色。你可以在[这里](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html?trk=cndc-detail)找到更多关于它的信息。 import sagemaker import boto3 sess = sagemaker.Session() # sagemaker session bucket -> used for uploading data, models and logs # sagemaker will automatically create this bucket if it not exists sagemaker_session_bucket=None if sagemaker_session_bucket is None and sess is not None: # set to default bucket if a bucket name is not given sagemaker_session_bucket = sess.default_bucket() role = sagemaker.get_execution_role() except ValueError: iam = boto3.client('iam') role = iam.get_role(RoleName='sagemaker_execution_role')\['Role']\['Arn'] sess = sagemaker.Session(default_bucket=sagemaker_session_bucket) print(f"sagemaker role arn: {role}") print(f"sagemaker session region: {sess.boto_region_name}") ### 2. 获取全新 Hugging Face LLM DLC 与部署常规的 HuggingFace 模型相比，我们首先需要检索容器 URI 并将其提供给我们的 HuggingFaceModel 模型类，并使用 image_uri 指向该镜像。要在亚马逊 SageMaker 中检索新的 HuggingFace LLM DLC，我们可以使用 SageMaker SDK 提供的 get_huggingface_llm_image_uri 方法。此方法允许我们根据指定的 “后端”、“会话”、“区域” 和 “版本” 检索所需的 Hugging Face LLM DLC 的 URI。你可以在[这里](https://github.com/aws/deep-learning-containers/blob/master/available_images.md#huggingface-text-generation-inference-containers?trk=cndc-detail)找到可用的版本。 from sagemaker.huggingface import get_huggingface_llm_image_uri # retrieve the llm image uri llm_image = get_huggingface_llm_image_uri( "huggingface", version="0.6.0" # print ecr image uri print(f"llm image uri: {llm_image}") 要将 \[Open Assistant Model]（openAssistant/Pythia-12b-sft-v8-7K-steps）部署到亚马逊 SageMaker，我们创建了一个 HuggingFaceModel 模型类并定义了我们的终端节点配置，包括 hf_model_id、instance_type 等。我们将使用 g5.4xlarge 实例类型，它有 1 个 NVIDIA A10G GPU 和 64GB 的 GPU 内存。 import json from sagemaker.huggingface import HuggingFaceModel # Define Model and Endpoint configuration parameter hf_model_id = "OpenAssistant/pythia-12b-sft-v8-7k-steps" # model id from huggingface.co/models use_quantization = True # wether to use quantization or not instance_type = "ml.g5.4xlarge" # instance type to use for deployment number_of_gpu = 1 # number of gpus to use for inference and tensor parallelism health_check_timeout = 300 # Increase the timeout for the health check to 5 minutes for downloading the model # create HuggingFaceModel with the image uri llm_model = HuggingFaceModel( role=role, image_uri=llm_image, env={ 'HF_MODEL_ID': hf_model_id, 'HF_MODEL_QUANTIZE': json.dumps(use_quantization), 'SM_NUM_GPUS': json.dumps(number_of_gpu) SageMaker 现在将创建我们的端点并将模型部署到该端点。这可能需要 10-15 分钟。 ### 4. 进行推理并与我们的模型聊天部署终端节点后，我们可以对其进行推理。我们将使用 predictor 中的 predict 方法在我们的端点上进行推理。我们可以用不同的参数进行推断来影响生成。参数可以设置在 parameter 中设置。截至今天，TGI 支持以下参数： * 温度：控制模型中的随机性。较低的值将使模型更具确定性，而较高的值将使模型更随机。默认值为 0。 * max_new_tokens：要生成的最大 token 数量。默认值为 20，最大值为 512。 * repeption_penalty：控制重复的可能性，默认为 null。 * seed：用于随机生成的种子，默认为 null。 * stop：用于停止生成的代币列表。生成其中一个令牌后，生成将停止。 * top_k：用于 top-k 筛选时保留的最高概率词汇标记的数量。默认值为 null，它禁用 top-k 过滤。 * top_p：用于核采样时保留的参数最高概率词汇标记的累积概率，默认为 null。 * do_sample：是否使用采样；否则使用贪婪的解码。默认值为 false。 * best_of：生成 best_of 序列如果是最高标记 logpros 则返回序列，默认为null。 * details：是否返回有关世代的详细信息。默认值为 false。 * return_full_text：是返回全文还是只返回生成的部分。默认值为 false。 * truncate：是否将输入截断到模型的最大长度。默认值为 true。 * typical_p：代币的典型概率。默认值为 null。 * 水印：生成时使用的水印。默认值为 false。你可以在 [swagger 文档](https://huggingface.github.io/text-generation-inference/?trk=cndc-detail)中找到 TGI 的开放 api 规范。 openAssistant/Pythia-12b-sft-v8-7K-steps 是一种对话式聊天模型，这意味着我们可以使用以下提示与它聊天： <|prompter|> \[指令] <|文本末尾|> 让我们先试一试，问一下夏天可以做的一些很酷的想法： from sagemaker.huggingface import HuggingFacePredictor llm = HuggingFacePredictor(endpoint_name="huggingface-pytorch-tgi-inference-2023-06-01-03-54-10-543") chat = llm.predict({ "inputs": """<|prompter|>What are some cool ideas to do in the summer?<|endoftext|><|assistant|>""" print(chat\[0]\["generated_text"]) 现在，我们将使用不同的参数进行推理，以影响生成。参数可以通过输入的 parameters 属性定义。这可以用来让模型在 “机器人” 回合后停止生成。 # define payload prompt="""<|prompter|>How can i stay more active during winter? Give me 3 tips.<|endoftext|><|assistant|>""" # hyperparameters for llm payload = { "inputs": prompt, "parameters": { "do_sample": True, "top_p": 0.7, "temperature": 0.7, "top_k": 50, "max_new_tokens": 256, "repetition_penalty": 1.03, "stop": \["<|endoftext|>"] # send request to endpoint response = llm.predict(payload) # print(response\[0]\["generated_text"]\[:-len("\:")]) print(response\[0]\["generated_text"]) 现在让我们构建一个快速 gradio 应用程序来和它聊天。 !pip install gradio --upgrade import gradio as gr # hyperparameters for llm parameters = { "do_sample": True, "top_p": 0.7, "temperature": 0.7, "top_k": 50, "max_new_tokens": 256, "repetition_penalty": 1.03, "stop": \["<|endoftext|>"] with gr.Blocks() as demo: gr.Markdown("## Chat with Amazon SageMaker") with gr.Column(): chatbot = gr.Chatbot() with gr.Row(): with gr.Column(): message = gr.Textbox(label="Chat Message Box", placeholder="Chat Message Box", show_label=False) with gr.Column(): with gr.Row(): submit = gr.Button("Submit") clear = gr.Button("Clear") def respond(message, chat_history): # convert chat history to prompt converted_chat_history = "" if len(chat_history) > 0: for c in chat_history: converted_chat_history += f"<|prompter|>{c\[0]}<|endoftext|><|assistant|>{c\[1]}<|endoftext|>" prompt = f"{converted_chat_history}<|prompter|>{message}<|endoftext|><|assistant|>" # send request to endpoint llm_response = llm.predict({"inputs": prompt, "parameters": parameters}) # remove prompt from response parsed_response = llm_response\[0]\["generated_text"]\[len(prompt):] chat_history.append((message, parsed_response)) return "", chat_history submit.click(respond, \[message, chatbot], \[message, chatbot], queue=False) clear.click(lambda: None, None, chatbot, queue=False) demo.launch(share=True) Running on local URL: http\://127.0.0.1:7861 Running on public URL: https\://4a3a32ce84d8b318b9.gradio.live This share link expires in 72 hours. For free permanent hosting and GPU upgrades (NEW!), check out Spaces: https\://huggingface.co/spaces 程序运行成功后，显示如下聊天窗口： ![插图1.jpg](https://dev-media.amazoncloud.cn/9cc9ca0b17fc4af48167ae566fb8809e_%E6%8F%92%E5%9B%BE1.jpg "插图1.jpg") 太棒了！我们已经成功地将 Open Assistant 模型部署到亚马逊 SageMaker 并对其进行了推理。此外，我们还构建了一个快速的 gradio 应用程序，可以与我们的模型聊天。现在，您到了可以使用亚马逊 SageMaker 上全新 Hugging Face LLM DLC 构建世代人工智能应用程序的时候了。 ### 5. 清理环境我们可以删除模型和端点。 ```powershell llm.delete_model() llm.delete_endpoint() 如果您想在 SageMaker 上测试和运行上面的例子，可以在 github 地址（）获取完整的 notebook。 ### 6. 总结从上面的部署过程，我们可以看到整个部署过程非常简单，这个主要得益于 SageMaker Hugging Face LLM DLC 的支持，还可以通过将 SageMaker 部署的端点与您的应用集成，满足实际的业务需求。 ##### 作者介绍 ![刘恒涛.jpg](https://dev-media.amazoncloud.cn/e0cd745e1b8f4c73b7080e2edb6b53f0_%E5%88%98%E6%81%92%E6%B6%9B.jpg "刘恒涛.jpg")\ **刘恒涛** 亚马逊云科技解决方案架构师，负责基于亚马逊云科技的云计算方案架构咨询和设计。同时致力于亚马逊云科技云服务在国内的应用和推广，当前重点关注机器学习以及 Serverless 领域。联系亚马逊云科技专家