🚀 [Release Notes] 2024.10

Although we encountered several unexpected difficulties (like the lack of computing resources and manpower) in the past few months, we are constantly maintaining this repo and trying to deliver some new stuff to the community. In this release (202410), we provide two new models: 

1. [VideoLLaMA2.1-7B-16F](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-16F)
    - Supercharging VideoLLaMA2 with [SigLIP](https://huggingface.co/google/siglip-so400m-patch14-384) and [Qwen2](https://huggingface.co/Qwen/Qwen2-7B-Instruct)
    - Training VideoLLaMA2 on more textual data (largely from [Magpie](https://huggingface.co/Magpie-Align) and [ALLaVA](https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V)) to enhance the instruction following capability
    - Improved results on almost all of the benchmarks

| Model                | Egoschema | Perception-Test | MVBench | VideoMME  | MSVC (Caption) | ActivityNet-QA |
|:----------------------|:-----------|:-----------------|:---------|:-----------|:----------------|:-------------|
| VideoLLaMA2-7B-16F   | 51.7      | 51.4            | 54.6    | 47.9/50.3 | 2.53/2.59      | 50.2/3.3    |
| VideoLLaMA2.1-7B-16F | 53.1      | 54.9            | 57.3    | 54.9/56.4 | 2.87/2.81      | 53.0/3.4    |

2. [VideoLLaMA2.1-7B-AV](https://huggingface.co/DAMO-NLP-SG/VideoLLaMA2.1-7B-AV)
    - Trained from VideoLLaMA2.1-7B-16F
    - Included more audio-visual joint training data (from [AVInstruct](https://github.com/rikeilong/Bay-CAT/tree/main/AVinstruct)) and more pure-text data
    - Improved training recipes (e.g., we found that smaller batch sizes in audio-related training always give better results) 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 [Release Notes] 2024.10 #116

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Model	Egoschema	Perception-Test	MVBench	VideoMME	MSVC (Caption)	ActivityNet-QA
VideoLLaMA2-7B-16F	51.7	51.4	54.6	47.9/50.3	2.53/2.59	50.2/3.3
VideoLLaMA2.1-7B-16F	53.1	54.9	57.3	54.9/56.4	2.87/2.81	53.0/3.4

🚀 [Release Notes] 2024.10 #116

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions