添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接

Repository files navigation

LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance. LightLLM harnesses the strengths of numerous well-regarded open-source implementations, including but not limited to FasterTransformer, TGI, vLLM, and FlashAttention.

Features

  • Tri-process asynchronous collaboration: tokenization, model inference, and detokenization are performed asynchronously, leading to a considerable improvement in GPU utilization.
  • Nopad (Unpad): offers support for nopad attention operations across multiple models to efficiently handle requests with large length disparities.
  • Dynamic Batch: enables dynamic batch scheduling of requests
  • FlashAttention : incorporates FlashAttention to improve speed and reduce GPU memory footprint during inference.
  • Tensor Parallelism: utilizes tensor parallelism over multiple GPUs for faster inference.
  • Token Attention : implements token-wise's KV cache memory management mechanism, allowing for zero memory waste during inference.
  • High-performance Router: collaborates with Token Attention to meticulously manage the GPU memory of each token, thereby optimizing system throughput.
  • Int8KV Cache: This feature will increase the capacity of tokens to almost twice as much. only llama support.
  • Supported Model List

  • BLOOM
  • LLaMA
  • LLaMA V2
  • StarCoder
  • Qwen-7b
  • ChatGLM2-6b
  • Baichuan-7b
  • Baichuan2-7b
  • Baichuan2-13b
  • Baichuan-13b
  • InternLM-7b
  • InternVL-Chat
  • Yi-34b
  • Qwen-VL
  • Qwen-VL-Chat
  • Llava-7b
  • Llava-13b
  • Mixtral
  • Stablelm
  • MiniCPM
  • Phi-3
  • CohereForAI
  • DeepSeek-V2-Lite
  • DeepSeek-V2
  • When you start Qwen-7b, you need to set the parameter '--eos_id 151643 --trust_remote_code'.

    ChatGLM2 needs to set the parameter '--trust_remote_code'.

    Baichuan and Baichuan2 needs to set the parameter '--trust_remote_code'.

    InternLM needs to set the parameter '--trust_remote_code'.

    InternVL-Chat(Phi3) needs to set the parameter '--eos_id 32007 --trust_remote_code'.

    InternVL-Chat(InternLM2) needs to set the parameter '--eos_id 92542 --trust_remote_code'.

    Stablelm needs to set the parameter '--trust_remote_code'.

    Phi-3 only supports Mini and Small.

    DeepSeek-V2-Lite and DeepSeek-V2 need to set the parameter '--data_type bfloat16'

    Get started

    Requirements

    The code has been tested with Pytorch>=1.3, CUDA 11.8, and Python 3.9. To install the necessary dependencies, please refer to the provided requirements.txt and follow the instructions as

    pip install -r requirements.txt

    Container

    You can use the official Docker container to run the model more easily. To do this, follow these steps:

    Pull the container from the GitHub Container Registry:

    docker pull ghcr.io/modeltc/lightllm:main

    Run the container with GPU support and port mapping:

    docker run -it --gpus all -p 8080:8080                  \
            --shm-size 1g -v your_local_path:/data/         \
            ghcr.io/modeltc/lightllm:main /bin/bash

    Alternatively, you can build the container yourself:

    docker build -t <image_name> .
    docker run -it --gpus all -p 8080:8080                  \
            --shm-size 1g -v your_local_path:/data/         \
            <image_name> /bin/bash

    You can also use a helper script to launch both the container and the server:

    python tools/quick_launch_docker.py --help

    Note: If you use multiple GPUs, you may need to increase the shared memory size by adding --shm-size to the docker run command.

    Installation

  • Install from the source code by
  • python setup.py install
  • Install Triton Package
  • The code has been tested on a range of GPUs including V100, A100, A800, 4090, and H800. If you are running the code on A100, A800, etc., we recommend using triton==2.1.0.

    pip install triton==2.1.0 --no-deps

    If you are running the code on H800 or V100., we recommend using triton-nightly, triton-nightly has a significant CPU bottleneck, leading to high decode latency at low concurrency levels. You can observe this issue and fix PR .You can try modifying and compiling the source code yourself to resolve this issue.

    pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps

    RUN LLaMA

    With efficient Routers and TokenAttention, LightLLM can be deployed as a service and achieve the state-of-the-art throughput performance.

    Launch the server:

    python -m lightllm.server.api_server --model_dir /path/llama-7B     \
                                         --host 0.0.0.0                 \
                                         --port 8080                    \
                                         --tp 1                         \
                                         --max_total_token_num 120000

    The parameter max_total_token_num is influenced by the GPU memory of the deployment environment. A larger value for this parameter allows for the processing of more concurrent requests, thereby increasing system concurrency. For more startup parameters, please refer to api_server.py or ApiServerArgs.md .

    To initiate a query in the shell:

    curl http://127.0.0.1:8080/generate     \
        -X POST                             \
        -d '{"inputs":"What is AI?","parameters":{"max_new_tokens":17, "frequency_penalty":1}}' \
        -H 'Content-Type: application/json'

    To query from Python:

    import time
    import requests
    import json
    url = 'http://localhost:8080/generate'
    headers = {'Content-Type': 'application/json'}
    data = {
        'inputs': 'What is AI?',
        "parameters": {
            'do_sample': False,
            'ignore_eos': False,
            'max_new_tokens': 1024,
    response = requests.post(url, headers=headers, data=json.dumps(data))
    if response.status_code == 200:
        print(response.json())
    else:
        print('Error:', response.status_code, response.text)

    RUN Multimodal Models

    Run QWen-VL
    python -m lightllm.server.api_server \
        --host 0.0.0.0                 \
        --port 8080                    \
        --tp 1                         \
        --max_total_token_num 12000    \
        --trust_remote_code            \
        --enable_multimodal            \
        --cache_capacity 1000          \
        --model_dir /path/of/Qwen-VL or /path/of/Qwen-VL-Chat
    Run Llava
    python -m lightllm.server.api_server \
        --host 0.0.0.0                 \
        --port 8080                    \
        --tp 1                         \
        --max_total_token_num 12000    \
        --trust_remote_code            \
        --enable_multimodal            \
        --cache_capacity 1000          \
        --model_dir /path/of/llava-v1.5-7b or /path/of/llava-v1.5-13b
    Query From QWen-VL
    import time
    import requests
    import json
    import base64
    url = 'http://localhost:8080/generate'
    headers = {'Content-Type': 'application/json'}
    uri = "/local/path/of/image" # or "/http/path/of/image"
    if uri.startswith("http"):
        images = [{"type": "url", "data": uri}]
    else:
        with open(uri, 'rb') as fin:
            b64 = base64.b64encode(fin.read()).decode("utf-8")
        images=[{'type': "base64", "data": b64}]
    data = {
        "inputs": "<img></img>Generate the caption in English with grounding:",
        "parameters": {
            "max_new_tokens": 200,
            # The space before <|endoftext|> is important, the server will remove the first bos_token_id, but QWen tokenizer does not has bos_token_id
            "stop_sequences": [" <|endoftext|>"],
        "multimodal_params": {
            "images": images,
    response = requests.post(url, headers=headers, data=json.dumps(data))
    if response.status_code == 200:
        print(response.json())
    else:
        print('Error:', response.status_code, response.text)
    Query From QWen-VL-Chat
    import json
    import requests
    import base64
    def run_once(query, uris):
        images = []
        for uri in uris:
            if uri.startswith("http"):
                images.append({"type": "url", "data": uri})
            else:
                with open(uri, 'rb') as fin:
                    b64 = base64.b64encode(fin.read()).decode("utf-8")
                images.append({'type': "base64", "data": b64})
        data = {
            "inputs": query,
            "parameters": {
                "max_new_tokens": 200,
                # The space before <|endoftext|> is important, the server will remove the first bos_token_id, but QWen tokenizer does not has bos_token_id
                "stop_sequences": [" <|endoftext|>", " <|im_start|>", " <|im_end|>"],
            "multimodal_params": {
                "images": images,
        # url = "http://127.0.0.1:8080/generate_stream"
        url = "http://127.0.0.1:8080/generate"
        headers = {'Content-Type': 'application/json'}
        response = requests.post(url, headers=headers, data=json.dumps(data))
        if response.status_code == 200:
            print(" + result: ({})".format(response.json()))
        else:
            print(' + error: {}, {}'.format(response.status_code, response.text))
    multi-img, multi-round:
    <|im_start|>system
    You are a helpful assistant.<|im_end|>
    <|im_start|>user
    <img></img>
    <img></img>
    上面两张图片分别是哪两个城市?请对它们进行对比。<|im_end|>
    <|im_start|>assistant
    根据提供的信息,两张图片分别是重庆和北京。<|im_end|>
    <|im_start|>user
    这两座城市分别在什么地方?<|im_end|>
    <|im_start|>assistant
    run_once(
        uris = [
            "assets/mm_tutorial/Chongqing.jpeg",
            "assets/mm_tutorial/Beijing.jpeg",
        query = "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<img></img>\n<img></img>\n上面两张图片分别是哪两个城市?请对它们进行对比。<|im_end|>\n<|im_start|>assistant\n根据提供的信息,两张图片分别是重庆和北京。<|im_end|>\n<|im_start|>user\n这两座城市分别在什么地方?<|im_end|>\n<|im_start|>assistant\n"
    
    Query From Llava
    import time
    import requests
    import json
    import base64
    url = 'http://localhost:8080/generate'
    headers = {'Content-Type': 'application/json'}
    uri = "/local/path/of/image" # or "/http/path/of/image"
    if uri.startswith("http"):
        images = [{"type": "url", "data": uri}]
    else:
        with open(uri, 'rb') as fin:
            b64 = base64.b64encode(fin.read()).decode("utf-8")
        images=[{'type': "base64", "data": b64}]
    data = {
        "inputs": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\nPlease explain the picture. ASSISTANT:",
        "parameters": {
            "max_new_tokens": 200,
        "multimodal_params": {
            "images": images,
    response = requests.post(url, headers=headers, data=json.dumps(data))
    if response.status_code == 200:
        print(response.json())
    else:
        print('Error:', response.status_code, response.text)

    Additional lanuch parameters: --enable_multimodal, --cache_capacity, larger --cache_capacity requires larger shm-size

    Support --tp > 1, when tp > 1, visual model run on the gpu 0

    The special image tag for Qwen-VL is <img></img> (<image> for Llava), the length of data["multimodal_params"]["images"] should be the same as the count of tags, The number can be 0, 1, 2, ...

    Input images format: list for dict like {'type': 'url'/'base64', 'data': xxx}

    Performance

    Service Performance

    We compared the service performance of LightLLM and vLLM==0.1.2 on LLaMA-7B using an A800 with 80G GPU memory.

    To begin, prepare the data as follows:

    wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json

    Launch the service:

    python -m lightllm.server.api_server --model_dir /path/llama-7b --tp 1 --max_total_token_num 121060 --tokenizer_mode auto

    Evaluation:

    cd test
    python benchmark_serving.py --tokenizer /path/llama-7b --dataset /path/ShareGPT_V3_unfiltered_cleaned_split.json --num-prompts 2000 --request-rate 200

    The performance comparison results are presented below:

    LightLLM

    Static inference performance

    For debugging, we offer static performance testing scripts for various models. For instance, you can evaluate the inference performance of the LLaMA model by

    cd test/model
    python test_llama.py
  • The LLaMA tokenizer fails to load.
  • consider resolving this by running the command pip install protobuf==3.20.0.
  • error : PTX .version 7.4 does not support .target sm_89
  • launch with bash tools/resolve_ptx_version python -m lightllm.server.api_server ...
  • Projects using lightllm

    If you have a project that should be incorporated, please contact via email or create a pull request.

    LazyLLM: Easyest and lazyest way for building multi-agent LLMs applications.

    Once you have installed lightllm and lazyllm, and then you can use the following code to build your own chatbot:

    from lazyllm import TrainableModule, deploy, WebModule
    # Model will be download automatically if you have an internet connection
    m = TrainableModule('internlm2-chat-7b').deploy_method(deploy.lightllm)
    WebModule(m).start().wait()

    Documents: https://lazyllm.readthedocs.io/

    Community

    For further information and discussion, join our discord server.

    License

    This repository is released under the Apache-2.0 license.

    Acknowledgement

    We learned a lot from the following projects when developing LightLLM.

  • Faster Transformer
  • Text Generation Inference
  • Flash Attention 1&2
  • OpenAI Triton
  • LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance.

    Topics

    deep-learning llama model-serving openai-triton