MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

1University of Maryland, College Park   2Meta   3University of Central Florida
Interpolate start reference image.

MA-LMM Long-term memory bank auto-regressively stores and accumulates past video information.

Interpolate start reference image.

GPU memory consumption of existing multimodal methods and MA-LMM during inference. Circle sizes represent the number of input text tokens.

Abstract

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding.

In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits.

Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets.

Architecture

Interpolate start reference image.

(a) Framework overview. MA-LMM auto-regressively processes video frames in an online manner. Two long-term memory banks are designed to store the raw visual features and learned queries at each timestep, which are used for future reference. The Q-Former is composed of several cascaded blocks, indexed by l. LLM outputs text for various video understanding downstream tasks. The snowflake icon indicates components with fixed parameters, while the flame icon denotes parts of the model that are fine-tuned. (b) Illustration of the memory bank compression technique, which is applied to maintain the length of the memory bank constant.

Results

Long-term Video Understanding

Interpolate start reference image.
Long-term Video Understanding results on the LVU, Breakfast and COIN datasets.

Video Question Answering

Interpolate start reference image.
Video Question Answering results on the MSRVTT, MSVD and ActivityNet datasets.

Visualization

Video Question Answering

Interpolate start reference image.
Visualization results on the video question answering task and the online off-the-shelf setting.

Video Captioning

Interpolate start reference image.
Visualization results on the video question answering task and the online off-the-shelf setting.

BibTeX

@article{he2024malmm,
  author    = {He, Bo and Li, Hengduo and Jang, Young Kyun and Jia, Menglin and Cao, Xuefei and Shah, Ashish and Shrivastava, Abhinav and Lim, Ser-Nam},
  title     = {MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding},
  journal   = {CVPR},
  year      = {2024},
}