Skip to content

🚀 [Release Notes] 2024.10 #116

@lixin4ever

Description

@lixin4ever

Although we encountered several unexpected difficulties (like the lack of computing resources and manpower) in the past few months, we are constantly maintaining this repo and trying to deliver some new stuff to the community. In this release (202410), we provide two new models:

  1. VideoLLaMA2.1-7B-16F
    • Supercharging VideoLLaMA2 with SigLIP and Qwen2
    • Training VideoLLaMA2 on more textual data (largely from Magpie and ALLaVA) to enhance the instruction following capability
    • Improved results on almost all of the benchmarks
Model Egoschema Perception-Test MVBench VideoMME MSVC (Caption) ActivityNet-QA
VideoLLaMA2-7B-16F 51.7 51.4 54.6 47.9/50.3 2.53/2.59 50.2/3.3
VideoLLaMA2.1-7B-16F 53.1 54.9 57.3 54.9/56.4 2.87/2.81 53.0/3.4
  1. VideoLLaMA2.1-7B-AV
    • Trained from VideoLLaMA2.1-7B-16F
    • Included more audio-visual joint training data (from AVInstruct) and more pure-text data
    • Improved training recipes (e.g., we found that smaller batch sizes in audio-related training always give better results)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions