GitHub - motern88/HiLight: Video Language Model for Motern AI

HiLight：Video Language Model with Dual Vision Encoders

[2407.07325] HiLight: Technical Report on the Motern AI Video Language Model (arxiv.org)

This is a video-language model where the dual-tower structure of the visual extractor comprises a CLIP-ViP video feature extractor and a Long-CLIP image feature extractor. The features obtained are fused through the TokenMining module and then passed to the language model Gemma-2B for generation.

AutoDL Deployment(Recommendation)

1.Load our public image on the AutoDL cloud server:

https://www.codewithgpu.com/i/motern88/HiLight/HiLight_VideoLanguageModel

2.Make sure model weight exit in /root/autodl-tmp/ .

Since the AutoDL system disk only has 30GB and our model weights are approximately 9GB each, the image cannot include the model weights. Therefore, you will need to store the downloaded weights separately on the AutoDL data disk(/root/autodl-tmp).

/root/autodl-tmp
├──HiLight-2B-LoRA-Merge-stage2-M101K-2epoch
├──HiLight-2B-LoRA-Merge-stage2-VT1K-2epoch
├──HiLight-2B-LoRA-Merge-stage2-tune(TM+LLM)M101K-3epoch
├──HiLight-2B-LoRA-Merge-stage2-tune(TM+LLM)VT1K-2epoch

please download model weight from Huggingface: ATA-space/HiLight , and put in right path.

after that, you can run webui_demo.py .

Local Deployment

1.Enter the root directory to install the environment.

PyTorch 2.1.0
Python 3.10(ubuntu22.04)
Cuda 12.1

pip install -r requirements.txt

2.Make sure model weight exit in this path.

HiLight
├──data
├──hilight
├──scripts
├──model_zoo
│   ├──CLIP-ViP
│   │   ├──clipvip-vision_config.json
│   │   ├──pretrain_clipvip_base_16.pt
│   ├──LLM
│   │   ├──gemma
│   │   │   ├──tokenizer
│   ├──Long-CLIP
│   │   ├──longclip-B.pt
│   ├──TokenMining
│   │   ├──[only stage2 training need tokenmining.bin(stage1 training output)]
├──work_dirs
│   ├──HiLight-2B-LoRA-Merge-stage2-M101K-2epoch
│   ├──HiLight-2B-LoRA-Merge-stage2-VT1K-2epoch
│   ├──HiLight-2B-LoRA-Merge-stage2-tune(TM+LLM)M101K-3epoch
│   ├──HiLight-2B-LoRA-Merge-stage2-tune(TM+LLM)VT1K-2epoch

please download model weight from Huggingface: ATA-space/HiLight , and put in right path.

WebUI Inference(Recommendation)

python webui_demo.py

CLI Inference

We did not continue debugging the predict_cli.py script, so you may need to modify the weight path in the code yourself.

python predict_cli.py --videos_file <path to your video>

Train

The training process consists of two stages: (1) feature alignment stage: bridge the vision and language tokens; (2) instruction tuning stage: teach the model to follow multimodal instructions.

1.Grant script execution permissions.

chmod +x scripts/gemma/train/ds_HiLight_gemma_train_GPUs_stage2.sh
chmod +x scripts/gemma/train/ds_HiLight_gemma_train_GPUs.sh

2.Set NCCL.

export NCCL_P2P_DISABLE=1
export NCCL_IB_DISABLE=1

3.Configure .sh script parameters.

# in scripts/gemma/train/
--data_path  # Train data json path
--val_path  # Val data json path
--video_folder  # Train data video(.pkl) folder
--val_folder  # val data video(.pkl) folder

--output_dir  # model weight output

4.Run DeepSpeed on a single machine with multiple GPUs.

Stage1: only train TokenMining.

scripts/gemma/train/ds_HiLight_gemma_train_GPUs.sh

Stage2: train LoRA-LLM(optionally tune TokenMining).

scripts/gemma/train/ds_HiLight_gemma_train_GPUs_stage2.sh

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
asset		asset
data/HiLight-Finetune		data/HiLight-Finetune
hilight		hilight
scripts		scripts
README.md		README.md
check_bin_weight.py		check_bin_weight.py
hostfile		hostfile
merge_lora.py		merge_lora.py
predict_cli.py		predict_cli.py
requirements.txt		requirements.txt
train_GPUs.py		train_GPUs.py
train_GPUs_lora.py		train_GPUs_lora.py
webui_demo.py		webui_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiLight：Video Language Model with Dual Vision Encoders

AutoDL Deployment(Recommendation)

Local Deployment

WebUI Inference(Recommendation)

CLI Inference

Train

About

Releases

Packages

Languages

motern88/HiLight

Folders and files

Latest commit

History

Repository files navigation

HiLight：Video Language Model with Dual Vision Encoders

AutoDL Deployment(Recommendation)

Local Deployment

WebUI Inference(Recommendation)

CLI Inference

Train

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages