MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Yuan, Huaying; Ni, Jian; Liu, Zheng; Wang, Yueze; Zhou, Junjie; Liang, Zhengyang; Zhao, Bo; Cao, Zhao; Dou, Zhicheng; Wen, Ji-Rong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2502.12558 (cs)

[Submitted on 18 Feb 2025 (v1), last revised 20 May 2025 (this version, v4)]

Title:MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Authors:Huaying Yuan, Jian Ni, Zheng Liu, Yueze Wang, Junjie Zhou, Zhengyang Liang, Bo Zhao, Zhao Cao, Zhicheng Dou, Ji-Rong Wen

View PDF HTML (experimental)

Abstract:Accurately locating key moments within long videos is crucial for solving long video understanding (LVU) tasks. However, existing benchmarks are either severely limited in terms of video length and task diversity, or they focus solely on the end-to-end LVU performance, making them inappropriate for evaluating whether key moments can be accurately accessed. To address this challenge, we propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR), distinguished by the following features. First, it is created based on long and diverse videos, averaging over 1200 seconds in duration and collected from various domains, e.g., movie, anomaly, egocentric, and sports. Second, it covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning, etc. Third, it incorporates rich forms of queries, including text-only queries, image-conditioned queries, and video-conditioned queries. On top of MomentSeeker, we conduct comprehensive experiments for both generation-based approaches (directly using MLLMs) and retrieval-based approaches (leveraging video retrievers). Our results reveal the significant challenges in long-video moment retrieval in terms of accuracy and efficiency, despite improvements from the latest long-video MLLMs and task-specific fine-tuning. We have publicly released MomentSeeker(this https URL) to facilitate future research in this area.

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2502.12558 [cs.CV]
	(or arXiv:2502.12558v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2502.12558

Submission history

From: Huaying Yuan [view email]
[v1] Tue, 18 Feb 2025 05:50:23 UTC (11,161 KB)
[v2] Mon, 10 Mar 2025 05:34:20 UTC (5,908 KB)
[v3] Wed, 16 Apr 2025 03:11:44 UTC (5,925 KB)
[v4] Tue, 20 May 2025 03:30:44 UTC (20,736 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators