Masking Modalities for Cross-modal Video Retrieval

Gabeur, Valentin; Nagrani, Arsha; Sun, Chen; Alahari, Karteek; Schmid, Cordelia

Computer Science > Computer Vision and Pattern Recognition

arXiv:2111.01300 (cs)

[Submitted on 1 Nov 2021 (v1), last revised 3 Nov 2021 (this version, v2)]

Title:Masking Modalities for Cross-modal Video Retrieval

Authors:Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

View PDF

Abstract:Pre-training on large scale unlabelled datasets has shown impressive performance improvements in the fields of computer vision and natural language processing. Given the advent of large-scale instructional video datasets, a common strategy for pre-training video encoders is to use the accompanying speech as weak supervision. However, as speech is used to supervise the pre-training, it is never seen by the video encoder, which does not learn to process that modality. We address this drawback of current pre-training methods, which fail to exploit the rich cues in spoken language. Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech. We mask an entire modality in the input and predict it using the other two modalities. This encourages each modality to collaborate with the others, and our video encoder learns to process appearance and audio as well as speech. We show the superior performance of our "modality masking" pre-training approach for video retrieval on the How2R, YouCook2 and Condensed Movies datasets.

Comments:	Accepted at WACV 2022
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2111.01300 [cs.CV]
	(or arXiv:2111.01300v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2111.01300

Submission history

From: Valentin Gabeur [view email]
[v1] Mon, 1 Nov 2021 23:55:04 UTC (4,938 KB)
[v2] Wed, 3 Nov 2021 12:36:48 UTC (4,938 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Masking Modalities for Cross-modal Video Retrieval

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Masking Modalities for Cross-modal Video Retrieval

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators