Bo He (何博)
I am currently a research scientist at Meta. I have graduated from the department of Computer Science at University of Maryland, College Park, advised by Prof. Abhinav Shrivastava. I obtained my Bachelor's degree at University of Chinese Academy of Sciences, China in 2018.
I'm interested in video-related tasks, especially on video understanding, video compression, and multimodal learning.
Email: bohe [at] umd [dot] edu
CV  / 
Github  / 
Google Scholar  / 
LinkedIn
|
|
News
- [2023-07] One paper is accepted to ICCV 2023.
- [2023-05] I start an internship at Meta working on large language models in video understanding tasks.
- [2023-03] Two papers are accepted to CVPR 2023.
- [2022-09] One paper is accepted to BMVC 2022.
- [2022-07] One paper is accepted to ECCV 2022.
- [2022-05] I start an internship at Adobe working on multi-modal summarization task, supervised by Zhaowen Wang and Trung Bui.
- [2022-03] One paper is accepted to CVPR 2022.
|
Publications
|
Chop & Learn: Recognizing and Generating Object-State Compositions
Nirat Saini*,
Hanyu Wang*,
Archana Swaminathan,
Vinoj Jayasundara,
Kamal Gupta,
Bo He,
Abhinav Shrivastava
ICCV, 2023
project page /
arxiv /
code
We focus the task of cutting objects in different styles and the resulting object state changes. We propose a new benchmark suite Chop & Learn, to accommodate the needs of learning objects and different cut styles using multiple viewpoints.
|
|
Towards Scalable Neural Representation for Diverse Videos
Bo He,
Xitong Yang,
Hanyu Wang,
Zuxuan Wu,
Hao Chen,
Shuaiyi Huang,
Yixuan Ren,
Ser-Nam Lim,
Abhinav Shrivastava
CVPR, 2023
project page /
arxiv /
code
We propose D-NeRV, a novel implicit neural representation based framework designed to encode large-scale and diverse videos. It achieves state-of-the-art performances on the video compression task.
|
|
Align and Attend: Multimodal Summarization with Dual Contrastive Losses
Bo He,
Jun Wang,
Jielin Qiu,
Trung Bui,
Abhinav Shrivastava,
Zhaowen Wang
CVPR, 2023
project page /
arxiv /
code
We propose A2Summ, a novel supervised multimodal summarization framework that summarize video frames and text sentences with time correspondence. We also collect a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries.
|
|
CNeRV: Content-adaptive Neural Representation for Visual Data
Hao Chen,
Matt Gwilliam,
Bo He,
Ser-Nam Lim,
Abhinav Shrivastava
BMVC, 2022 (oral)
project page /
arxiv
We propose neural visual representation with content-adaptive embedding, which combines the generalizability of autoencoders with the simplicity and compactness of implicit representation. We match the performance of NeRV, a state-of-the-art implicit neural representation, on the reconstruction task for frames seen during training while far surpassing for unseen frames that are skipped during training.
|
|
Learning Semantic Correspondence with Sparse Annotations
Shuaiyi Huang,
Luyu Yang,
Bo He,
Songyang Zhang,
Xuming He,
Abhinav Shrivastava
ECCV, 2022
project page /
arxiv /
code
We address the challenge of label sparsity in semantic correspondence by enriching supervision signals from sparse keypoint annotations. We first propose a teacher-student learning paradigm for generating dense pseudo-labels and then develop two novel strategies for denoising pseudo-labels. Our approach establishes the new state-of-the-art on three challenging benchmarks for semantic correspondence.
|
|
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization
Bo He,
Xitong Yang,
Le Kang,
Zhiyu Cheng,
Xin Zhou,
Abhinav Shrivastava
CVPR, 2022
project page /
arxiv /
code
We propose ASM-Loc, a novel weakly supervised temporal action localization framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. We establish new state of the art on THUMOS-14 and ActivityNet-v1.3 datasets.
|
|
Recognizing Actions using Object States
Nirat Saini,
Bo He,
Gaurav Shrivastava,
Sai Saketh Rambhatla,
Abhinav Shrivastava
ICLR Workshop, 2022
arxiv
We propose a computational framework that uses only two object states, start and end, and learns to recognize the underlying actions.
|
|
NeRV: Neural Representations for Videos
Hao Chen,
Bo He,
Hanyu Wang,
Yixuan Ren,
Ser-Nam Lim,
Abhinav Shrivastava
NeurIPS , 2021
project page /
arxiv /
code
We propose a novel image-wise neural representation (NeRV) to encodes videos in neural networks, which takes frame index as input and outputs the corresponding RGB image. Compared to image-wise neural representation, NeRV imrpoves encoding speed by 25× to 70×, decoding speed by 38× to 132×. And it also shows comparable preformance for visual compression and denoising task.
|
|
GTA: Global Temporal Attention for Video Action Understanding
Bo He,
Xitong Yang,
Zuxuan Wu,
Hao Chen,
Ser-Nam Lim,
Abhinav Shrivastava
BMVC , 2021
arxiv
We introduce Global Temporal Attention (GTA), which performs global temporal attention on top of spatial attention in a decoupled manner. We apply GTA on both pixels and semantically similar regions to capture temporal relationships at different levels of spatial granularity.
|
Services
- Program Committee/Reviewers: CVPR, ICCV, ECCV, AAAI, NeurIPS, TPAMI
|
Thank Dr. Jon Barron for sharing the source code of his personal page.
|
|