Valerii Likhosherstov, Anurag Arnab, Krzysztof Choromanski, Mario Lucic, Yi Tay, Adrian Weller, Mostafa Dehghani, "PolyViT: Co-training Vision Transformers on Images, Videos and Audio" arXiv2021
https://arxiv.org/abs/2111.12993