Bo He (何博)

I am currently a research scientist at Meta. I have graduated from the department of Computer Science at University of Maryland, College Park, advised by Prof. Abhinav Shrivastava. I obtained my Bachelor's degree at University of Chinese Academy of Sciences, China in 2018.

I'm interested in video-related tasks, especially on video understanding, video compression, and multimodal learning.

Email: bohe [at] umd [dot] edu

CV  /  Github  /  Google Scholar  /  LinkedIn

profile photo

  • [2024-03] Two papers are accepted to CVPR 2024.
  • [2024-02] I joined Meta as a research scientist.
  • [2023-07] One paper is accepted to ICCV 2023.
  • [2023-05] I start an internship at Meta working on large language models in video understanding tasks.
  • [2023-03] Two papers are accepted to CVPR 2023.
  • [2022-09] One paper is accepted to BMVC 2022.
  • [2022-07] One paper is accepted to ECCV 2022.
  • [2022-05] I start an internship at Adobe working on multi-modal summarization task, supervised by Zhaowen Wang and Trung Bui.
  • [2022-03] One paper is accepted to CVPR 2022.

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Abhinav Shrivastava, Ser-Nam Lim
CVPR, 2024
project page / arxiv / code

We propose a memory-augmented large multimodal model for efficient and effective long-term video understanding ability. Our model can achieve state-of-the-art performances across multiple tasks such as long-video understanding, video question answering, and video captioning.

OmniViD: A Generative Framework for Universal Video Understanding
Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, Yu-Gang Jiang
CVPR, 2024
arxiv / code

We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. This enables us to address various types of video tasks, including classification (such as action recognition), captioning (covering clip captioning, video question answering, and dense video captioning), and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture, following a generative framework.

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang
arxiv / code

We introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks.

Chop & Learn: Recognizing and Generating Object-State Compositions
Nirat Saini*, Hanyu Wang*, Archana Swaminathan, Vinoj Jayasundara, Kamal Gupta, Bo He,
Abhinav Shrivastava
ICCV, 2023
project page / arxiv / code

We focus the task of cutting objects in different styles and the resulting object state changes. We propose a new benchmark suite Chop & Learn, to accommodate the needs of learning objects and different cut styles using multiple viewpoints.

Towards Scalable Neural Representation for Diverse Videos
Bo He, Xitong Yang, Hanyu Wang, Zuxuan Wu, Hao Chen, Shuaiyi Huang,
Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava
CVPR, 2023
project page / arxiv / code

We propose D-NeRV, a novel implicit neural representation based framework designed to encode large-scale and diverse videos. It achieves state-of-the-art performances on video compression.

Align and Attend: Multimodal Summarization with Dual Contrastive Losses
Bo He, Jun Wang, Jielin Qiu, Trung Bui, Abhinav Shrivastava, Zhaowen Wang
CVPR, 2023
project page / arxiv / code

We propose A2Summ, a novel supervised multimodal summarization framework that summarize video frames and text sentences with time correspondence. We also collect a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries.

CNeRV: Content-adaptive Neural Representation for Visual Data
Hao Chen, Matt Gwilliam, Bo He, Ser-Nam Lim, Abhinav Shrivastava
BMVC, 2022 (oral)
project page / arxiv

We propose neural visual representation with content-adaptive embedding, which combines the generalizability of autoencoders with the simplicity and compactness of implicit representation. We match the performance of NeRV, a state-of-the-art implicit neural representation, on the reconstruction task for frames seen during training while far surpassing for unseen frames that are skipped during training.

Learning Semantic Correspondence with Sparse Annotations
Shuaiyi Huang, Luyu Yang, Bo He, Songyang Zhang, Xuming He, Abhinav Shrivastava
ECCV, 2022
project page / arxiv / code

We address the challenge of label sparsity in semantic correspondence by enriching supervision signals from sparse keypoint annotations. We first propose a teacher-student learning paradigm for generating dense pseudo-labels and then develop two novel strategies for denoising pseudo-labels. Our approach establishes the new state-of-the-art on three challenging benchmarks for semantic correspondence.

ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization
Bo He, Xitong Yang, Le Kang, Zhiyu Cheng, Xin Zhou, Abhinav Shrivastava
CVPR, 2022
project page / arxiv / code

We propose ASM-Loc, a novel weakly supervised temporal action localization framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. We establish new state of the art on THUMOS-14 and ActivityNet-v1.3 datasets.

Recognizing Actions using Object States
Nirat Saini, Bo He, Gaurav Shrivastava, Sai Saketh Rambhatla, Abhinav Shrivastava
ICLR Workshop, 2022

We propose a computational framework that uses only two object states, start and end, and learns to recognize the underlying actions.

NeRV: Neural Representations for Videos
Hao Chen, Bo He, Hanyu Wang, Yixuan Ren, Ser-Nam Lim, Abhinav Shrivastava
NeurIPS , 2021
project page / arxiv / code

We propose a novel image-wise neural representation (NeRV) to encodes videos in neural networks, which takes frame index as input and outputs the corresponding RGB image. Compared to image-wise neural representation, NeRV imrpoves encoding speed by 25× to 70×, decoding speed by 38× to 132×. And it also shows comparable preformance for visual compression and denoising task.

GTA: Global Temporal Attention for Video Action Understanding
Bo He, Xitong Yang, Zuxuan Wu, Hao Chen, Ser-Nam Lim, Abhinav Shrivastava
BMVC , 2021

We introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. We apply GTA on both pixels and semantically similar regions to capture temporal relationships at different levels of spatial granularity.

  • Program Committee/Reviewers: CVPR, ICCV, ECCV, AAAI, NeurIPS, TPAMI

Thank Dr. Jon Barron for sharing the source code of his personal page.

Web Counters
Web Counters