添加链接
link管理
链接快照平台
  • 输入网页链接,自动生成快照
  • 标签化管理网页链接
相关文章推荐
勤奋的饭卡  ·  On Key overview - Pragma·  2 周前    · 
爱旅游的感冒药  ·  使用Ultralytics ...·  1 周前    · 
气势凌人的大葱  ·  AI Image Generator: ...·  3 天前    · 
聪明的冰棍  ·  Sql server, .net and ...·  3 天前    · 
讲道义的甘蔗  ·  SDK Wrapper Example ...·  1小时前    · 
大力的长颈鹿  ·  JavaScript Fetch API ...·  2 月前    · 
豪爽的刺猬  ·  2023年开通的沈阳地铁 ...·  10 月前    · 
豪气的斑马  ·  Linux ...·  1 年前    · 

⛰️Valley: Video Assistant with Large Language model Enhanced abilitY

Understanding Complex Videos Relying on Large Language and Vision Models

[ Project Page ] [ Paper ] [ demo ]

The online demo is no longer available, because we released the code for offline demo deployment

Video Assistant with Large Language model Enhanced abilitY
Ruipu Luo* , Ziwang Zhao* , Min Yang* (*Equal Contribution)

Generated by stablecog via "A cute llama with valley" Usage and License Notices : The data, code and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Release

  • [8/14] 🔥 We released the Chinese version of Valley! Now its 7B and 13b weights are available at Chinese-Valley7B-V1 and Chinese-Valley13B-V1 .
  • [8/10] 🔥 Realeased pretrain stage weight of 13b and 7b , Valley2-7b-pretrain , valley-13b-pretrain .
  • [8/8] 🔥 We released the self-collected and expanded instruction fine-tuning dataset ( Valley-Instruct-73k ).
  • [8/7] 🔥 We released Valley2-7b , It replaces Vicuna with Llama 2.
  • [7/23] 🫧 We modified the our training code to make it easier to train valley and also support the training of lora.
  • [7/5] 🫧 Release training code for valley, and upload our pretraining data.
  • [6/21] 🫧 upload offline demo code.
  • [6/14] 🫧 build a share link [ demo ] .
  • [6/13] 🫧 We uploaded model weight of Valley-13b-v1-delta .
  • [6/12] 🫧 We released Valley: Video Assistant with Large Language model Enhanced abilitY. Checkout the paper .
  • Release inference code
  • Upload weight of Valley-v1 and build a share link demo
  • Upload offline demo code
  • Release 703k pretraining data and 73k instruction tuning data
  • Upload pretrain and tuning code
  • Upload weight of Valley2-7B and Valley-v3
  • Install

  • Clone this repository and navigate to Valley folder
  • git clone https://github.com/RupertLuo/Valley.git
    cd Valley
    
  • Install Package
  • conda create -n valley python=3.10 -y
    conda activate valley
    pip install --upgrade pip 
    pip install -e .
    

    In the pretrain stage, we use the data from LLaVA-CC3M-Pretrain-595K and the Valley-webvid2M-Pretrain-703K collected and filtered by ourselves. The acquisition of picture and video data can refer to LLAVA and Webvid

    In the finetune stage, we use the data from LLaVA-instruct-150K, VideoChat-instruct-11K and our self-collected Valley-Instruct-73k. For the images and videos of the first two parts, please refer to their official website. Here we describe how we obtain the data we collect ourselves (Valley-Instruct-73k).

  • Part of Valley-Instruct-73k is collected from the open source dataset VATEX, which contains about 20k downloadable videos. You can download the original annotation file ("ava_vatex_training_v1.0.json") from its official website. Its video comes from YouTube, and now there are many open source tools that can download YouTube videos by video id. We provide a tool to download its videos, the tool is located in the Crawler folder, please read the tool's Readme.md to use it.
  • Another part of Valley-Instruct-73k is collected from a video site, named JukinMedia. It contains a wide variety of videos. We also provide a tool to download jukinmedia videos and its high quality descriptions, the tool is located in the Crawler folder, please read the tool's Readme.md to use it.
  • ValleyWeight

    Valley 13b v1

    We release Valley-13b-v1 delta weights weights to comply with the LLaMA model license. You can apply this delta weights to original LLaMA model weight through the instructions blew:

  • Get the original LLaMA weights in the huggingface format by following the instructions structions here.
  • Use the following scripts to get Valley weights by applying our delta (13b-v1).
  • python3 valley/model/apply_delta.py \
        --base /path/to/llama-13b \
        --target /output/path/to/Valley-13B-v1 \
        --delta /path/to/valley-13b-v1-delta

    Valley2 7b

    For the Valley2-7b model, we provide direct weights, the address is here

    Chinese Valley 13b

    We now support Chinese valley. We use "BelleGroup/BELLE-LLaMA-EXT-13B" as LLM backbone, and "OFA-Sys/chinese-clip-vit-large-patch14" for visual backbone, the address is here.

    Pretrain Weight

    We provide 13b and 7b pre-trained weights so that people can fine-tune directly on our pre-trained weights with their own fine-tuning data.

    Web UI

    The framework of this webUI comes from LLaVA and FastChat, we modified a part of the code to make this demo support the input of video and images.

    launch a controller

    python valley/serve/controller.py
    

    launch a model worker

    python valley/serve/model_worker.py --model-path /path/to/valley-13b-v1
    

    Ps: At present, only single card mode is supported to load the model, and at least 30G of video memory is required, so the graphics card needs at least one Tesla V100.

    launch a gradio demo

    python valley/serve/gradio_web_server_video.py --share

    Inference Valley in Command Line

    We now update inference code which is more convient, and supports input in the form of openai api.

    Inference CLI

    python3 inference/run_valley.py --model-name [PATH TO VALLEY WEIGHT] --video_file [PATH TO VIDEO] --quary [YOUR QUERY ON THE VIDEO]
    

    Inference Chinese Valley

    python3 inference/run_valley.py --model-name [PATH TO CHINESE VALLEY WEIGHT] --video_file [PATH TO VIDEO] --quary [YOUR QUERY ON THE VIDEO] --system-prompt "你是大型语言视觉助手 Chinese-Valley。你能够理解用户提供的视觉内容或视频,并使用自然语言协助用户完成各种任务。请仔细按照人类的指令进行回答,并详细解释你的答案。"
    

    Inference in code

    from transformers import AutoTokenizer
    from valley.model.valley import ValleyLlamaForCausalLM
    def init_vision_token(model,tokenizer):
        vision_config = model.get_model().vision_tower.config
        vision_config.im_start_token, vision_config.im_end_token = tokenizer.convert_tokens_to_ids([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN])
        vision_config.vi_start_token, vision_config.vi_end_token = tokenizer.convert_tokens_to_ids([DEFAULT_VI_START_TOKEN, DEFAULT_VI_END_TOKEN])
        vision_config.vi_frame_token = tokenizer.convert_tokens_to_ids(DEFAULT_VIDEO_FRAME_TOKEN)
        vision_config.im_patch_token = tokenizer.convert_tokens_to_ids([DEFAULT_IMAGE_PATCH_TOKEN])[0]
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    # input the query
    query = "Describe the video concisely."
    # input the systemprompt
    system_prompt = "You are Valley, a large language and vision assistant trained by ByteDance. You are able to understand the visual content or video that the user provides, and assist the user with a variety of tasks using natural language. Follow the instructions carefully and explain your answers in detail."
    model_path = THE MODEL PATH
    model = ValleyLlamaForCausalLM.from_pretrained(model_path, torch_dtype=torch.float16)
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    init_vision_token(model,tokenizer)
    model = model.to(device)
    model.eval()
    # we support openai format input
    message = [ {"role":'system','content':system_prompt},
                {"role":"user", "content": 'Hi!'},
                {"role":"assistent", "content": 'Hi there! How can I help you today?'},
                {"role":"user", "content": query}]
    gen_kwargs = dict(
        do_sample=True,
        temperature=0.2,
        max_new_tokens=1024,
    response = model.completion(tokenizer, args.video_file, message, gen_kwargs, device)

    Train Valley Step By Step

    Inspired by LLAVA, we adopt a two-stage training method. The pre-training stage uses the Valley-webvid2M-Pretrain-703K and LLaVA-CC3M-Pretrain-595K. And fine-tune stage uses LLaVA-instruct-150K , VideoChat-instruct-11K and Valley-Instruct-73k

    We modified our code for training valley and managed the model hyperparameters with yaml files. Run the following two scripts to perform valley training.

    Pretrain

    The llm backbone that currently supports pre-training is Llama(7b,13b), vicuna(7b,13b), stable-vicuna(13b), Llama2(chat-7b, chat-13b). You need to download these open source language model weights yourself and convert them to the huggingface format.

    bash valley/train/train.sh valley/configs/experiment/valley_stage1.yaml

    Finetune

    bash valley/train/train.sh valley/configs/experiment/valley_stage2.yaml

    Acknowledgement

  • LLaVA & MOSS: Thanks to these two repositories for providing high-quality code, our code is based on them.
  • Citation

    If the project is helpful to your research, please consider citing our paper as follows

    @misc{luo2023valley,
          title={Valley: Video Assistant with Large Language model Enhanced abilitY}, 
          author={Ruipu Luo and Ziwang Zhao and Min Yang and Junwei Dong and Minghui Qiu and Pengcheng Lu and Tao Wang and Zhongyu Wei},
          year={2023},
          eprint={2306.07207},
          archivePrefix={arXiv},
          primaryClass={cs.CV}