LARP: Tokenizing Videos with a Learned
Autoregressive Generative Prior

Hanyu Wang,  Saksham Suri,  Yixuan Ren,  Hao Chen,  Abhinav Shrivastava
University of Maryland, College Park
{hywang66, sakshams, yxren}@umd.edu   {chenh, abhinav}@cs.umd.edu
Project page: https://hywang66.github.io/larp/
Abstract

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP’s strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).

Refer to caption
Figure 1: LARP highlights. (a) LARP is a video tokenizer for two-stage video generative models. In the first stage, LARP tokenizer is trained with a lightweight AR prior model to learn an AR-friendly latent space. In the second stage, an AR generative model is trained on LARP’s discrete tokens to synthesize high-fidelity videos. (b) The incorporation of the AR prior model significantly improves the generation FVD (gFVD) across various token number configurations. (c) LARP shows a much smaller gap between its reconstruction FVD (rFVD) and generation FVD (gFVD), indicating the effectiveness of the optimized latent space it has learned.

1 Introduction

The field of generative modeling has experienced significant advancements, largely driven by the success of autoregressive (AR) models in the development of large language models (LLMs) (Bai et al., 2023; Brown, 2020; Radford et al., 2019; Google et al., 2023; Touvron et al., 2023a; b). Building on AR transformers (Vaswani, 2017), these models are considered pivotal for the future of AI due to their exceptional performance (Hendrycks et al., 2020; 2021), impressive scalability (Henighan et al., 2020; Kaplan et al., 2020; Rae et al., 2021), and versatile flexibility (Radford et al., 2019; Brown, 2020).

Inspired by the success of LLMs, recent works have begun to employ AR transformers for visual generation (Van Den Oord et al., 2017; Razavi et al., 2019; Esser et al., 2021; Hong et al., 2022; Ge et al., 2022; Kondratyuk et al., 2023; Wang et al., 2024). Additionally, several recent developments have extended LLMs to handle multimodal inputs and outputs (Lu et al., 2022; Zheng et al., 2024), further demonstrating the promising potential of AR models in visual content generation. All of these methods employ a visual tokenizer to convert continuous visual signals into sequences of discrete tokens, allowing them to be autoregressively modeled in the same way as natural language is modeled by LLMs. Typically, a visual tokenizer consists of a visual encoder, a quantization module (Van Den Oord et al., 2017; Yu et al., 2023b), and a visual decoder. The generative modeling occurs in the quantized discrete latent space, with the decoder mapping the generated discrete token sequences back to continuous visual signals. It is evident that the visual tokenizer plays a pivotal role, as it directly influences the quality of the generated content. Building on this insight, several works have focused on improving the visual tokenizer (Lee et al., 2022; Yu et al., 2023b), making solid progress in enhancing the compression ratio and reconstruction fidelity of visual tokenization.

Most existing visual tokenizers follow a patchwise tokenization paradigm (Van Den Oord et al., 2017; Esser et al., 2021; Wang et al., 2024; Yu et al., 2023b), where the discrete tokens are quantized from the encoded patches of the original visual inputs. While these approaches are intuitive for visual data with spatial or spatialtemporal structures, they restrict the tokenizers’ ability to capture global and holistic representations of the entire input. This limitation becomes even more pronounced when applied to AR models, which rely on sequential processing and require locally encoded tokens to be transformed into linear 1D sequences. Previous research (Esser et al., 2021) has demonstrated that the method of flattening these patch tokens into a sequence is critical to the generation quality of AR models. Although most existing works adopt a raster scan order for this transformation due to its simplicity, it remains uncertain whether this is the optimal strategy. In addition, there are no clear guidelines for determining the most effective flattening order.

On the other hand, although the reconstruction fidelity of a visual tokenizer sets an upper bound on the generation fidelity of AR models, the factors that determine the gap between them remain unclear. In fact, higher reconstruction quality has been widely reported to sometimes lead to worse generation fidelity (Zhang et al., 2023; Yu et al., 2024). This discrepancy highlights the limitations of the commonly used reconstruction-focused design of visual tokenizers and underscores the importance of ensuring desirable properties in the latent space of the tokenizer. However, very few works have attempted to address this aspect in improving image tokenizers (Gu et al., 2024; Zhang et al., 2023), and for video tokenizers, it has been almost entirely overlooked.

In this paper, we present LARP, a video tokenizer with a Learned AutoRegressive generative Prior, designed to address the underexplored challenges identified in previous work. By leveraging a ViT-style spatialtemporal patchifier (Dosovitskiy, 2020) and a transformer encoder architecture (Vaswani, 2017), LARP forms an autoencoder and employs a stochastic vector quantizer (Van Den Oord et al., 2017) to tokenize videos into holistic token sequences. Unlike traditional patchwise tokenizers, which directly encode input patches into discrete tokens, LARP introduces a set of learned queries (Carion et al., 2020; Li et al., 2023) that are concatenated with the input patch sequences and then encoded into holistic discrete tokens. An illustrative comparison between the patchwise tokenizer and LARP is shown in Figure 2 (a) and the left part of Figure 2 (b). By decoupling the direct correspondence between discrete tokens and input patches, LARP allows for a flexible number of discrete tokens, enabling a trade-off between tokenization quality and latent representation length. This design also empowers LARP to produce more holistic and semantic representations of video content.

To further align LARP’s latent space with AR generative models, we incorporate a lightweight AR transformer as a prior model. It autoregressively models LARP’s latent space during training, providing signals to encourage learning a latent space that is well-suited for AR models. Importantly, the prior model is trained simultaneously with the main modules of LARP , but it is discarded during inference, adding zero memory or computational overhead to the tokenizer. Notably, by combining holistic tokenization with the co-training of the AR prior model, LARP automatically determines an order for latent discrete tokens in AR generation and optimizes the tokenizer to perform optimally within that structure. This approach eliminates the need to manually define a flattening order, which remains an unsolved challenge for traditional tokenizers.

To evaluate the effectiveness of the LARP tokenizer, we train a series of Llama-like (Touvron et al., 2023a; b; Sun et al., 2024) autoregressive (AR) generation models. Leveraging the holistic tokens and the learned AR generative prior, LARP achieves a Frechét Video Distance (FVD) (Unterthiner et al., 2018) score of 57 on the UCF101 class-conditional video generation benchmark (Soomro, 2012), establishing a new state-of-the-art among all published video generative models, including proprietary and closed-source approaches like MAGVIT-v2 (Yu et al., 2023b). To summarize, our key contributions are listed as follows:

  • We present LARP, a novel video tokenizer that enables flexible, holistic tokenization, allowing for more semantic and global video representations.

  • LARP features a learned AR generative prior, achieved by co-training an AR prior model, which effectively aligns LARP’s latent space with the downstream AR generation task.

  • LARP significantly improves video generation quality for AR models across varying token sequence lengths, achieving state-of-the-art FVD performance on the UCF101 class-conditional video generation benchmark and outperforming all AR methods on the K600 frame prediction benchmark.

2 Related Work

2.1 Discrete Visual Tokenization

To enable AR models to generative high resolution visual contents, various discrete visual tokenization methods have been developed. The seminal work VQ-VAE (Van Den Oord et al., 2017; Razavi et al., 2019) introduces vector quantization to encode continuous images into discrete tokens, allowing them to be modeled by PixelCNN (Van den Oord et al., 2016). VQGAN (Esser et al., 2021) improves visual compression rate and perceptual reconstruction quality by incorporating GAN loss (Goodfellow et al., 2014) in training the autoencoder. Building on this, several works focus on improving tokenizer efficiency (Cao et al., 2023) and enhancing generation quality (Gu et al., 2024; Zheng et al., 2022; Zhang et al., 2023). Leveraging the powerful ViT (Dosovitskiy, 2020) architecture, ViT-VQGAN (Yu et al., 2021) improves VQGAN on image generationt tasks.

Inspired by the success of image tokenization, researchers extend VQGAN to videos using 3D CNNs (Ge et al., 2022; Yan et al., 2021; Yu et al., 2023a). C-ViViT (Villegas et al., 2022) employs the temporal-causal ViT architecture to tokenize videos, while more recent work, MAGVIT-v2 (Yu et al., 2023b), introduces lookup-free quantization, significantly expanding the size of the quantization codebook. OmniTokenizer (Wang et al., 2024) unifies image and video tokenization using the same tokenizer model and weights for both tasks.

It is worth noting that all of the above tokenizers follow the patchwise tokenization paradigm discussed in Section 1, and are therefore constrained by patch-to-token correspondence. Very recently, a concurrent work (Yu et al., 2024) proposes a compact tokenization approach for images. However, it neither defines a flattening order for the discrete tokens nor introduces any prior or regularization to improve downstream generation performance.

2.2 Visual Generation

Visual generation has been a long-standing area of interest in machine learning and computer vision research. The first major breakthrough comes with the rise of Generative Adversarial Networks (GANs) (Goodfellow et al., 2014; Karras et al., 2019; 2020; Skorokhodov et al., 2022), known for their intuitive mechanism and fast inference capabilities. AR methods are also widely applied in visual generation. Early works (Van Den Oord et al., 2016; Van den Oord et al., 2016; Chen et al., 2020) model pixel sequences autoregressively, but are limited in their ability to synthesize high-resolution content due to the extreme length of pixel sequences. Recent advancements in visual tokenization make AR generative models for visual content more practical. While all tokenizers discussed in Section 2.1 are suitable for AR generation, many focus on BERT-style (Devlin, 2018) masked visual generation (Chang et al., 2022), such as in Yu et al. (2023a; b; 2024). Diffusion models (Ho et al., 2020; Song et al., 2020; Peebles & Xie, 2023) have recently emerged to dominate image (Dhariwal & Nichol, 2021) and video synthesis (Ho et al., 2022), delivering impressive visual generation quality. By utilizing VAEs (Kingma, 2013) to reduce resolution, latent diffusion models (Rombach et al., 2022; Blattmann et al., 2023) further scale up, enabling multimodal visual generation (Betker et al., 2023; Saharia et al., 2022; Podell et al., 2023; Brooks et al., 2024).

3 Method

Refer to caption
Figure 2: Method overview. Cubes  represent video patches, circles indicate continuous embeddings, and squares  denote discrete tokens. (a) Patchwise video tokenizer used in previous works. (b) Left: The LARP tokenizer tokenizes videos in a holistic scheme, gathering information from the video using a set of learned queries. Right: The AR prior model, trained with LARP , predicts the next holistic token, enabling a latent space optimized for AR generation. The AR prior model is forwarded in two rounds per iteration. The red arrow represents the first round, and the purple arrows represent the second round. The reconstruction loss recsubscriptrec{\mathcal{L}}_{\text{rec}}caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT is omitted for simplicity.

3.1 Preliminary

Patchwise Video Tokenization. As discussed in Section 1, existing video tokenizers adopt a patchwise tokenization scheme, where latent tokens are encoded from the spatialtemporal patches of the input video. Typically, a patchwise video tokenizer consists of an encoder {\mathcal{E}}caligraphic_E, a decoder 𝒟𝒟{\mathcal{D}}caligraphic_D, and a quantizer 𝒬𝒬{\mathcal{Q}}caligraphic_Q. Given a video input 𝑽T×H×W×3𝑽superscript𝑇𝐻𝑊3{\bm{\mathsfit{V}}}\in{\mathbb{R}}^{T{\times}H{\times}W{\times}3}bold_slanted_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × 3 end_POSTSUPERSCRIPT, it is encoded, quantized, and reconstructed as:

𝒁=(𝑽),𝑿=𝒬(𝒁),𝑽^=𝒟(𝑿),formulae-sequence𝒁𝑽formulae-sequence𝑿𝒬𝒁^𝑽𝒟𝑿\displaystyle{\bm{\mathsfit{Z}}}={\mathcal{E}}({\bm{\mathsfit{V}}}),\qquad{\bm% {\mathsfit{X}}}={\mathcal{Q}}({\bm{\mathsfit{Z}}}),\qquad\hat{{\bm{\mathsfit{V% }}}}={\mathcal{D}}({\bm{\mathsfit{X}}}),bold_slanted_Z = caligraphic_E ( bold_slanted_V ) , bold_slanted_X = caligraphic_Q ( bold_slanted_Z ) , over^ start_ARG bold_slanted_V end_ARG = caligraphic_D ( bold_slanted_X ) , (1)

where 𝒁TfT×HfH×WfW×d𝒁superscript𝑇subscript𝑓𝑇𝐻subscript𝑓𝐻𝑊subscript𝑓𝑊𝑑{\bm{\mathsfit{Z}}}\in{\mathbb{R}}^{\frac{T}{f_{T}}{\times}\frac{H}{f_{H}}{% \times}\frac{W}{f_{W}}{\times}d}bold_slanted_Z ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_T end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG × divide start_ARG italic_H end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_ARG × italic_d end_POSTSUPERSCRIPT refers to the spatialtemporally downsampled video feature maps with d𝑑ditalic_d latent dimensions per location, 𝑿T×H×W𝑿superscriptsuperscript𝑇superscript𝐻superscript𝑊{\bm{\mathsfit{X}}}\in{\mathbb{N}}^{T^{\prime}{\times}H^{\prime}{\times}W^{% \prime}}bold_slanted_X ∈ blackboard_N start_POSTSUPERSCRIPT italic_T start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_W start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT denotes the quantized discrete tokens, and 𝑽^^𝑽\hat{{\bm{\mathsfit{V}}}}over^ start_ARG bold_slanted_V end_ARG is the reconstructed video. fT,fH,fWsubscript𝑓𝑇subscript𝑓𝐻subscript𝑓𝑊f_{T},f_{H},f_{W}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT are the downsampling factors for the spatialtemporal dimensions T,H,W𝑇𝐻𝑊T,H,Witalic_T , italic_H , italic_W, respectively.

Despite different implementations of the encoder {\mathcal{E}}caligraphic_E, decoder 𝒟𝒟{\mathcal{D}}caligraphic_D, and quantizer 𝒬𝒬{\mathcal{Q}}caligraphic_Q, all patchwise tokenizers maintain a fixed downsampling factor for each spatialtemporal dimension. The latent vector 𝒁i,j,k,:dsubscript𝒁𝑖𝑗𝑘:superscript𝑑{\bm{\mathsfit{Z}}}_{i,j,k,:}\in{\mathbb{R}}^{d}bold_slanted_Z start_POSTSUBSCRIPT italic_i , italic_j , italic_k , : end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT at each position is typically the direct output of its spatialtemporally corresponding input video patch (e.g., same spatialtemporal location in CNNs, or token position in transformers). While this design is intuitive for 3D signals like video, it limits the discrete tokens to low-level patch features, hindering their ability to capture higher-level, holistic information. Moreover, this formulation introduces the challenge of flattening patch tokens into a unidirectional sequence, which is critical for AR generation.

Autoregressive Modeling. Given a sequence of discrete tokens 𝒙=(x1,x2,,xn)𝒙subscript𝑥1subscript𝑥2subscript𝑥𝑛{\bm{x}}=(x_{1},x_{2},\dots,x_{n})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), we can train a neural network to model the probability distribution pθ(𝒙)subscript𝑝𝜃𝒙p_{\theta}({\bm{x}})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) autoregressively as follows:

pθ(𝒙)=i=1npθ(xix1,,xi1,θ),subscript𝑝𝜃𝒙superscriptsubscriptproduct𝑖1𝑛subscript𝑝𝜃conditionalsubscript𝑥𝑖subscript𝑥1subscript𝑥𝑖1𝜃\displaystyle p_{\theta}({\bm{x}})=\prod_{i=1}^{n}p_{\theta}\left(x_{i}\mid x_% {1},\dots,x_{i-1},\theta\right),italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_θ ) , (2)

where θ𝜃\thetaitalic_θ denotes the neural network parameters. This model can be conveniently trained by optimizing the negative log-likelihood (NLL) of pθ(𝒙)subscript𝑝𝜃𝒙p_{\theta}({\bm{x}})italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_italic_x ). During inference, it iteratively predicts the next token xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by sampling from pθ(xix1,,xi1,θ)subscript𝑝𝜃conditionalsubscript𝑥𝑖subscript𝑥1subscript𝑥𝑖1𝜃p_{\theta}\left(x_{i}\mid x_{1},\dots,x_{i-1},\theta\right)italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_θ ), based on the previously generated tokens.

While autoregressive modeling imposes no direct constraints on data modality, it does require the data to be both discrete and sequential, which necessitates the use of a visual tokenizer when applied to images or videos.

3.2 Holistic Video Tokenization

Patchify. LARP employs the transformer architecture (Vaswani, 2017) due to its exceptional performance and scalability. Following the ViT framework (Dosovitskiy, 2020), we split the input video into spatialtemporal patches, and linearly encode each patch into continuous transformer patch embeddings. Formally, given a video input 𝑽T×H×W×3𝑽superscript𝑇𝐻𝑊3{\bm{\mathsfit{V}}}\in{\mathbb{R}}^{T{\times}H{\times}W{\times}3}bold_slanted_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × 3 end_POSTSUPERSCRIPT, the video is linearly patchified as follows:

𝑷=𝒫(𝑽),𝑬=flatten(𝑷),formulae-sequence𝑷𝒫𝑽𝑬flatten𝑷\displaystyle{\bm{\mathsfit{P}}}={\mathcal{P}}({\bm{\mathsfit{V}}}),\quad{\bm{% E}}=\operatorname{flatten}({\bm{\mathsfit{P}}}),bold_slanted_P = caligraphic_P ( bold_slanted_V ) , bold_italic_E = roman_flatten ( bold_slanted_P ) , (3)

where 𝒫𝒫{\mathcal{P}}caligraphic_P denotes the linear patchify operation, 𝑷TfT×HfH×WfW×d𝑷superscript𝑇subscript𝑓𝑇𝐻subscript𝑓𝐻𝑊subscript𝑓𝑊𝑑{\bm{\mathsfit{P}}}\in{\mathbb{R}}^{\frac{T}{f_{T}}{\times}\frac{H}{f_{H}}{% \times}\frac{W}{f_{W}}{\times}d}bold_slanted_P ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_T end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG × divide start_ARG italic_H end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_ARG × italic_d end_POSTSUPERSCRIPT is the spatialtemporal patches projected onto d𝑑ditalic_d dimensions, and 𝑬m×d𝑬superscript𝑚𝑑{\bm{E}}\in{\mathbb{R}}^{m{\times}d}bold_italic_E ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT is the flattened d𝑑ditalic_d-dimentional patch embeddings. Here, fT,fH,fWsubscript𝑓𝑇subscript𝑓𝐻subscript𝑓𝑊f_{T},f_{H},f_{W}italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT are the downsampling factors for dimensions T,H,W𝑇𝐻𝑊T,H,Witalic_T , italic_H , italic_W, respectively, and m=TfT×HfH×WfW𝑚𝑇subscript𝑓𝑇𝐻subscript𝑓𝐻𝑊subscript𝑓𝑊m=\frac{T}{f_{T}}{\times}\frac{H}{f_{H}}{\times}\frac{W}{f_{W}}italic_m = divide start_ARG italic_T end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG × divide start_ARG italic_H end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT end_ARG × divide start_ARG italic_W end_ARG start_ARG italic_f start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT end_ARG is the total number of tokens. Importantly, the patch embeddings 𝑬𝑬{\bm{E}}bold_italic_E remain local in nature, and therefore cannot be directly used to generate holistic discrete tokens.

Query-based Transformer. To design a holistic video tokenizer, it is crucial to avoid directly encoding individual patches into discrete tokens. To achieve this, we adapt the philosophy of Carion et al. (2020); Li et al. (2023) to learn a set of fixed input queries to capture the holistic information from the video, as illustrated in the left section of Figure 2 (b). For simplicity, LARP employs a transformer encoder 111Here and throughout this paper, “transformer encoder” refers to the specific parallel transformer encoder architecture defined in Dosovitskiy (2020) architecture, as opposed to the transformer encoder-decoder structure used in Carion et al. (2020). In-context conditioning is applied to enable information mixing between different patch and query tokens.

Formally, we define n𝑛nitalic_n learnable holistic query embedding 𝑸Ln×dsubscript𝑸𝐿superscript𝑛𝑑{\bm{Q}}_{L}\in{\mathbb{R}}^{n{\times}d}bold_italic_Q start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT, where each embedding is d𝑑ditalic_d-dimensional. These query embeddings are concatenated with the patch embeddings 𝑬𝑬{\bm{E}}bold_italic_E along the token dimension. The resulting sequence, now of length (n+m)𝑛𝑚(n+m)( italic_n + italic_m ), is then input to the LARP encoder {\mathcal{E}}caligraphic_E and quantizer 𝒬𝒬{\mathcal{Q}}caligraphic_Q as follows:

𝒁=(𝑸L𝑬),𝒙=𝒬(𝒁1:n,:),formulae-sequence𝒁conditionalsubscript𝑸𝐿𝑬𝒙𝒬subscript𝒁:1𝑛:\displaystyle{\bm{Z}}={\mathcal{E}}({\bm{Q}}_{L}\parallel{\bm{E}}),\quad{\bm{x% }}={\mathcal{Q}}({\bm{Z}}_{1:n,:}),bold_italic_Z = caligraphic_E ( bold_italic_Q start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ∥ bold_italic_E ) , bold_italic_x = caligraphic_Q ( bold_italic_Z start_POSTSUBSCRIPT 1 : italic_n , : end_POSTSUBSCRIPT ) , (4)

where parallel-to\parallel denotes the concatenation operation, 𝒁𝒁{\bm{Z}}bold_italic_Z is the latent embeddings, and 𝒙=(x1,,xn)𝒙subscript𝑥1subscript𝑥𝑛{\bm{x}}=(x_{1},\dots,x_{n})bold_italic_x = ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) denotes the quantized discrete tokens. Note that only 𝒁1:n,:subscript𝒁:1𝑛:{\bm{Z}}_{1:n,:}bold_italic_Z start_POSTSUBSCRIPT 1 : italic_n , : end_POSTSUBSCRIPT, i.e., the latent embeddings corresponding to the queries embeddings, are quantized and used. This ensures that each discrete token xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has equal chance to represent any video patch, eliminating both soft and hard local patch constraints.

The LARP decoder is also implemented as a transformer encoder neural network. During the decoding stage, LARP follows a similar approach, utilizing m𝑚mitalic_m learnable patch query embeddings 𝑸Pm×dsubscript𝑸𝑃superscript𝑚𝑑{\bm{Q}}_{P}\in{\mathbb{R}}^{m{\times}d}bold_italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT. The decoding process is defined as:

𝒁^=𝒬1(𝒙),V^=reshape(𝒟(𝑸P𝒁^)1:m,:),formulae-sequence^𝒁superscript𝒬1𝒙^𝑉reshape𝒟subscriptconditionalsubscript𝑸𝑃^𝒁:1𝑚:\displaystyle\hat{{\bm{Z}}}={\mathcal{Q}}^{-1}({\bm{x}}),\quad\hat{V}=% \operatorname{reshape}({\mathcal{D}}({\bm{Q}}_{P}\parallel\hat{{\bm{Z}}})_{1:m% ,:}),over^ start_ARG bold_italic_Z end_ARG = caligraphic_Q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_italic_x ) , over^ start_ARG italic_V end_ARG = roman_reshape ( caligraphic_D ( bold_italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ∥ over^ start_ARG bold_italic_Z end_ARG ) start_POSTSUBSCRIPT 1 : italic_m , : end_POSTSUBSCRIPT ) , (5)

where 𝒬1superscript𝒬1{\mathcal{Q}}^{-1}caligraphic_Q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT denotes the de-quantization operation that maps discrete tokens 𝒙𝒙{\bm{x}}bold_italic_x back to the continues latent embeddings 𝒁^n×d^𝒁superscript𝑛𝑑\hat{{\bm{Z}}}\in{\mathbb{R}}^{n{\times}d}over^ start_ARG bold_italic_Z end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT. These embeddings are concatenated with the patch query embeddings 𝑸Psubscript𝑸𝑃{\bm{Q}}_{P}bold_italic_Q start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, and the combined sequence of length m+n𝑚𝑛m+nitalic_m + italic_n is decoded into a sequence of continuous vectors. The first m𝑚mitalic_m vectors are reshaped to reconstruct the video V^T×H×W×3^𝑉superscript𝑇𝐻𝑊3\hat{V}\in{\mathbb{R}}^{T{\times}H{\times}W{\times}3}over^ start_ARG italic_V end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_H × italic_W × 3 end_POSTSUPERSCRIPT.

Crucially, although the latent tokens 𝒙𝒙{\bm{x}}bold_italic_x are now both holistic and discrete, no specific flattening order is imposed due to the unordered nature of the holistic query set and the parallel processing property of the transformer encoder. As a result, 𝒙𝒙{\bm{x}}bold_italic_x is not immediately suitable for AR modeling.

Stochastic Vector Quantization. While vector quantization (VQ) (Van Den Oord et al., 2017) has been widely adopted in previous visual quantizers (Esser et al., 2021; Ge et al., 2022), its deterministic nature limits the tokenizer’s ability to explore inter-code correlations, resulting less semantically rich codes. To address these limitations, LARP employs a stochastic vector quantization (SVQ) paradigm to implement the quantizer 𝒬𝒬{\mathcal{Q}}caligraphic_Q. Similar to VQ, SVQ maintains a codebook 𝑪c×d𝑪superscript𝑐superscript𝑑{\bm{C}}\in{\mathbb{R}}^{c{\times}d^{\prime}}bold_italic_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_c × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, which stores c𝑐citalic_c vectors, each of dimension dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The optimization objective SVQsubscriptSVQ{\mathcal{L}}_{\text{SVQ}}caligraphic_L start_POSTSUBSCRIPT SVQ end_POSTSUBSCRIPT includes a weighted sum of the commitment loss and the codebook loss, as defined in Van Den Oord et al. (2017). The key difference lies in the look-up operation. While VQ uses an argminargmin\operatorname*{arg\,min}roman_arg roman_min operation to find the closest code by minimizing the distance between the input vector 𝒗d𝒗superscriptsuperscript𝑑{\bm{v}}\in{\mathbb{R}}^{d^{\prime}}bold_italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT and all codes in 𝑪𝑪{\bm{C}}bold_italic_C, SVQ introduces stochasticity in this process. Specifically, SVQ computes the cosine similarities 𝒔𝒔{\bm{s}}bold_italic_s between the input vector 𝒗𝒗{\bm{v}}bold_italic_v and all code vectors in 𝑪𝑪{\bm{C}}bold_italic_C, interprets these similarities as logits, and applies a softmax normalization to obtain the probabilities 𝒑𝒑{\bm{p}}bold_italic_p. One index x𝑥xitalic_x is then sampled from the resulting multinomial distribution P(x)𝑃𝑥P(x)italic_P ( italic_x ). Formally, the SVQ process x=𝒬(𝒗)𝑥𝒬𝒗x={\mathcal{Q}}({\bm{v}})italic_x = caligraphic_Q ( bold_italic_v ) is defined as:

𝒔=𝒗𝑪i𝒗𝑪i,𝒑=softmax(𝒔)formulae-sequence𝒔𝒗subscript𝑪𝑖norm𝒗normsubscript𝑪𝑖𝒑softmax𝒔\displaystyle{\bm{s}}=\frac{{\bm{v}}\cdot{\bm{C}}_{i}}{\|{\bm{v}}\|\|{\bm{C}}_% {i}\|},\quad{\bm{p}}=\mathrm{softmax}({\bm{s}})bold_italic_s = divide start_ARG bold_italic_v ⋅ bold_italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_v ∥ ∥ bold_italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ end_ARG , bold_italic_p = roman_softmax ( bold_italic_s ) , (6)
xP(x)=j=1n𝒑i𝟏x=j,similar-to𝑥𝑃𝑥superscriptsubscriptproduct𝑗1𝑛superscriptsubscript𝒑𝑖subscript1𝑥𝑗\displaystyle x\sim P(x)=\prod_{j=1}^{n}{\bm{p}}_{i}^{\bm{1}_{x=j}},italic_x ∼ italic_P ( italic_x ) = ∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT bold_italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_1 start_POSTSUBSCRIPT italic_x = italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , (7)

where 𝟏1\bm{1}bold_1 denotes the indicator function. To maintain the differentiability of SVQ, we apply the straight-through estimator (Bengio et al., 2013). The de-quantization operation is performed via a straightforward index look-up, 𝒗^=𝒬1(x)=𝑪x^𝒗superscript𝒬1𝑥subscript𝑪𝑥\hat{{\bm{v}}}={\mathcal{Q}}^{-1}(x)={\bm{C}}_{x}over^ start_ARG bold_italic_v end_ARG = caligraphic_Q start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_x ) = bold_italic_C start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, similar to the standard VQ process.

Reconstructive Training. Following Esser et al. (2021); Ge et al. (2022); Yu et al. (2023a), the reconstructive training loss of LARP, recsubscriptrec{\mathcal{L}}_{\text{rec}}caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT, is composed of L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT reconstruction loss, LPIPS perceptual loss (Zhang et al., 2018), GAN loss (Goodfellow et al., 2014), and SVQ loss SVQsubscriptSVQ{\mathcal{L}}_{\text{SVQ}}caligraphic_L start_POSTSUBSCRIPT SVQ end_POSTSUBSCRIPT.

3.3 Learning an Autoregressive Generative Prior

Continuous Autoregressive Transformer. To better align LARP’s latent space with AR generative models, we introduce a lightweight AR transformer as a prior model, which provides gradients to push the latent space toward a structure optimized for AR generation. A key challenge in designing the prior model lies in its discrete nature. Simply applying an AR model to the discrete token sequence would prevent gradients from being back-propagated to the LARP encoder. Furthermore, unlike the stable discrete latent spaces of fully trained tokenizers, LARP’s latent space is continuously evolving during training, which can destabilize AR modeling and reduce the quality of the signals it provides to the encoder. To address these issues, we modify a standard AR transformer into a continuous AR transformer by redefining its input and output layers, as depicted in the right section of Figure 2 (b).

The input layer of a standard AR transformer is typically an embedding look-up layer. In the prior model of LARP, this is replaced with a linear projection that takes the de-quantized latents 𝒁^^𝒁\hat{{\bm{Z}}}over^ start_ARG bold_italic_Z end_ARG as input, ensuring proper gradient flow during training. The output layer of a standard AR transformer predicts the logits of the next token. While this does not block gradient propagation, it lacks awareness of the vector values in the codebook, making it unsuitable for the continuously evolving latent space during training. In contrast, the output layer of LARP’s AR prior model makes predictions following the SVQ scheme described in Section 3.2. It predicts an estimate of the next token’s embedding, 𝒗¯d¯𝒗superscriptsuperscript𝑑\bar{{\bm{v}}}\in{\mathbb{R}}^{d^{\prime}}over¯ start_ARG bold_italic_v end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT, which has the same shape as a codebook vectors 𝑪isubscript𝑪𝑖{\bm{C}}_{i}bold_italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Similar to SVQ, the predicted embedding 𝒗¯¯𝒗\bar{{\bm{v}}}over¯ start_ARG bold_italic_v end_ARG is used to compute cosine similarities with all code vectors in 𝑪𝑪{\bm{C}}bold_italic_C, as described in Equation 6. These similarities are then softmaxsoftmax\mathrm{softmax}roman_softmax-normalized and interpreted as probabilities, which are used to compute the negative log-likelihood (NLL) loss with the input tokens as the ground truth. To predict the next token, a sample is drawn from the resulting multinomial distribution using Equation 7. This output layer design ensures that the AR prior model remains aware of the continuously evolving codebook, enabling it to make more accurate predictions and provide more precise signals to effectively train the LARP tokenizer.

Scheduled Sampling. Exposure bias (Ranzato et al., 2015) is a well-known challenge in AR modeling. During training, the model is fed the ground-truth data to predict the next token. However, during inference, the model must rely on its own previous predictions, which may contain errors, creating a mismatch between training and inference conditions. While the AR prior model in LARP is only used during training, it encounters a similar issue: as the codebook evolves, the semantic meaning of discrete tokens can shift, making the input sequence misaligned with the prior model’s learned representations. To address this problem, we employ the scheduled sampling technique (Bengio et al., 2015; Mihaylova & Martins, 2019) within the AR prior model of LARP. Specifically, after the first forward pass of the prior model, we randomly mix the predicted output sequence with the original input sequence at the token level. This mixed sequence is then fed into the AR prior model for a second forward pass. The NLL loss is computed for both rounds of predictions and averaged, helping to reduce exposure bias and ensure more robust training.

Integration. Although the AR prior model functions as a standalone module, it is trained jointly with the LARP tokenizer in an end-to-end manner. Once the NLL loss priorsubscriptprior{\mathcal{L}}_{\text{prior}}caligraphic_L start_POSTSUBSCRIPT prior end_POSTSUBSCRIPT is computed, it is combined with the reconstructive loss recsubscriptrec{\mathcal{L}}_{\text{rec}}caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT to optimize the parameters of both the prior model and the tokenizer. Formally, the total loss is defined as:

=rec+αprior,subscriptrec𝛼subscriptprior\displaystyle{\mathcal{L}}={\mathcal{L}}_{\text{rec}}+\alpha{\mathcal{L}}_{% \text{prior}},caligraphic_L = caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT + italic_α caligraphic_L start_POSTSUBSCRIPT prior end_POSTSUBSCRIPT , (8)

where α𝛼\alphaitalic_α is the loss weight, and recsubscriptrec{\mathcal{L}}_{\text{rec}}caligraphic_L start_POSTSUBSCRIPT rec end_POSTSUBSCRIPT is defined in Section 3.2. Since α𝛼\alphaitalic_α is is typically set to a small value, we apply a higher learning rate to the parameters of the prior model to ensure effective learning. Importantly, the prior model is used solely to encourage an AR-friendly discrete latent space for LARP during training. It is discarded at inference time, meaning it has no effect on the inference speed or memory footprint.

4 Experiments

Refer to caption
(a) Scaling LARP tokenizer sizes.
Refer to caption
(b) Scaling Number of discrete tokens.
Figure 3: Scaling LARP tokenizer size and number of tokens.

4.1 Setup

Dataset. We conduct video reconstruction and generation experiments using the Kinetics-600 (K600)(Carreira et al., 2018) and UCF-101(Soomro, 2012) datasets. In all experiments, we use 16-frame video clips with a spatial resolution of 128×128128128128{\times}128128 × 128 for both training and evaluation following Ge et al. (2022); Yu et al. (2023a; b).

Implementation Details. LARP first patchifies the input video. In all experiments, the patch sizes are set to fT=4subscript𝑓𝑇4f_{T}=4italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 4, fH=8subscript𝑓𝐻8f_{H}=8italic_f start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT = 8, and fW=8subscript𝑓𝑊8f_{W}=8italic_f start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT = 8, respectively. As a result, a 16×128×1281612812816{\times}128{\times}12816 × 128 × 128 video clip is split into 4×16×16=10244161610244{\times}16{\times}16=10244 × 16 × 16 = 1024 video patches, which are projected into 1024 continuous patch embeddings in the first layer of LARP. For the SVQ quantizer, we utilize a factorized codebook with a size of 8192 and a dimension of d=8superscript𝑑8d^{\prime}=8italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 8, following the recommendations of Yu et al. (2021). The softmax normalization in Equation 6 is applied with a temperature of 0.03. The AR prior model in LARP is adapted from a small GPT-2 model (Radford et al., 2019), consisting of only 21.7M parameters. Scheduled sampling for the AR prior model employs a linear warm-up for the mixing rate, starting from 0 and reaching a peak of 0.5 at 30%percent3030\%30 % of the total training steps. We set AR prior loss weight α=0.06𝛼0.06\alpha=0.06italic_α = 0.06 in our main experiments, and use a learning rate multiplier of 50505050.

We employ a Llama-like Touvron et al. (2023a; b); Sun et al. (2024) transformer as our AR generative model. One class token [cls] and one separator token [sep] are used in the class-conditional generation task on UCF101 and frame prediction task on K600, respectively.

Frechét Video Distance (FVD) (Unterthiner et al., 2018) serves as the main evaluation metric for both reconstruction and generation experiments.

Method #Params #Tokens rFVD\downarrow gFVD\downarrow
Tokenizer Generator K600 UCF
Diffusion-based generative models with continuous video tokenizers
\hdashlineVideoFusion (Luo et al., 2023) - 2B - - - 173
HPDM (Skorokhodov et al., 2024) - 725M - - - 66
MLM generative models with discrete video tokenizers
\hdashlineMAGVIT-MLM (Yu et al., 2023a) 158M 306M 1024 25 9.9 76
MAGVIT-v2-MLM (Yu et al., 2023b) - 307M 1280 8.6 4.3 58
AR generative models with discrete video tokenizers
\hdashlineCogVideo (Hong et al., 2022) - 9.4B 2065 - 109.2 626
TATS (Ge et al., 2022) 32M 321M 1024 162 - 332
MAGVIT-AR (Yu et al., 2023a) 158M 306M 1024 25 - 265
MAGVIT-v2-AR (Yu et al., 2023b) - 840M 1280 8.6 - 109
OmniTokenizer (Wang et al., 2024) 82.2M 650M 1280 42 32.9 191
LARP-L (Ours) 173M 343M 1024 24 6.2 107
LARP-L-Long (Ours) 173M 343M 1024 20 6.2 102
LARP-L-Long (Ours) 173M 632M 1024 20 5.1 57
Table 1: Comparison of video generation results. Results are grouped by the type of generative models. The scores for MAGVIT-AR and MAGVIT-v2-AR are taken from the appendix of MAGVIT-v2 (Yu et al., 2023b). LARP-L-Long denotes the LARP-L trained for more epochs. Our best results are obtained with a larger AR generator.

4.2 Scaling

To explore the effect of scaling the LARP tokenizer, we begin by varying its size while keeping the number of latent tokens fixed at 1024. As shown in Figure 3 (a), we compare the reconstruction FVD (rFVD) and generation FVD (gFVD) for three scaled versions of LARP : LARP-L, LARP-B, and LARP-S, with parameter counts of 173.0M, 116.3M, and 39.8M, respectively. All results are reported on the UCF-101 dataset. Interestingly, while rFVD consistently improves as the tokenizer size increases, gFVD saturates when scaling from LARP-B to LARP-L, suggesting that gFVD can follow a different trend from rFVD. Notably, as shown in Figure 1 (c), LARP has already achieved the smallest gap between rFVD and gFVD, further demonstrating the effectiveness of the optimized latent space it has learned.

One of LARP’s key features is its holistic video tokenization, which supports an arbitrary number of latent discrete tokens. Intuitively, using more tokens slows down the AR generation process but improves reconstruction quality. Conversely, using fewer tokens significantly speeds up the process but may lead to lower reconstruction quality due to the smaller information bottleneck. To evaluate this trade-off, we use LARP-B and the default AR model, scaling down the number of latent tokens from 1024 to 512 and 256. The corresponding rFVD and gFVD results on the UCF-101 dataset are reported in Figure 3 (b). It is expected that both rFVD and gFVD increase when fewer tokens are used to represent a video. However, the rate of degradation in gFVD slows down when reducing from 512 to 256 tokens compared to rFVD, indicating improved generative representation efficiency.

4.3 Video Generation Comparison

For video generation, we compare LARP with other state-of-the-art published video generative models, including diffusion-based models, Masked Language Modeling (MLM) methods, and AR methods. We use the UCF-101 class-conditional generation benchmark and the K600 frame prediction benchmark, where the first 5 frames are provided to predict the next 11 frames in a 16-frame video clip. As shown in Table 1, LARP outperforms all other video generators on the UCF-101 dataset, setting a new state-of-the-art FVD of 57. Notably, within the family of AR generative models, LARP significantly surpasses all other AR methods by a large margin on both the UCF-101 and K600 datasets, including the closed-source MAGVIT-v2-AR (Yu et al., 2023b). Moreover, the last two rows of Table 1 demonstrate that using a larger AR generator can significantly improve LARP’s generation quality, hilighting the scalability of LARP’s representation.

4.4 Visualization

Video Reconstruction. In Figure 4, we compare video reconstruction quality of LARP with OmniTokenizer (Wang et al., 2024). LARP consistently outperforms OmniTokenizer, particularly in complex scenes and regions, further validating the rFVD comparison results shown in Table 1.

Refer to caption
Figure 4: Video reconstruction comparison with OmniTokenizer (Wang et al., 2024).

Class-Conditional Video Generation. We present class-conditional video generation results in Figure 5. LARP constructs a discrete latent space that better suited for AR generation, which enables the synthesis of high-fidelity videos, not only improving the quality of individual frames but also enhancing overall temporal consistency. Additional results are provided in the appendix.

Refer to caption
Figure 5: Class-conditional video generation results on the UCF-101 dataset using LARP.

Video Frame Prediction. Video frame prediction results are displayed in Figure 6. The vertical yellow line marks the boundary between the conditioned frames and the predicted frames. We use 5 frames as input to predict the following 11 frames, forming a 16-frame video clip, which is temporally downsampled to 8 frames for display. It is evident that LARP effectively predicts frames with diverse scenes and natural motions. Additional results are provided in the appendix.

Refer to caption
Figure 6: Video frame prediction results on the K600 dataset using LARP.
Configuration PSNR\uparrow LPIPS\downarrow rFVD\downarrow gFVD\downarrow
LARP-B 27.88 0.0855 31 107
No AR prior model 27.95 0.0830 23 190
No scheduled sampling in AR prior model 27.85 0.0856 27 142
Deterministic quantization 27.65 0.0884 27 149
Small AR prior model loss weight (α=0.03𝛼0.03\alpha=0.03italic_α = 0.03) 27.83 0.0866 28 120
No CFG 27.88 0.0855 31 121
Table 2: Ablation study. All configurations are modified from LARP-B model.

4.5 Ablation Study

To assess the impact of the different components proposed in Section 3, we perform an ablation study, with results shown in Table 2. Clearly, the AR prior model contributes the most to the exceptional performance of LARP. As further validated in Figure 1 (b), the improvement from using the AR prior model remains consistent across different token numbers. The scheduled sampling for the AR prior model and the use of SVQ are also critical, as both are closely tied to the AR prior model’s effectiveness. The loss weight of the AR prior model and the use of CFG have relatively minor effects on the generative performance. Interestingly, the model without the AR prior achieves the best reconstruction results but the worst generation results, highlighting the effectiveness of the AR prior model in enhancing LARP’s discrete latent space for generative tasks.

5 Conclusion and Future Work

In this paper, we introduce LARP, a novel video tokenizer tailored specifically for autoregressive (AR) generative models. By introducing a holistic tokenization scheme with learned queries, LARP captures more global and semantic video representations, offering greater flexibility in the number of discrete tokens. The integration of a lightweight AR prior model during training optimizes the latent space for AR generation and defines an optimal token order, significantly improving performance in AR tasks. Extensive experiments on video reconstruction, class-conditional video generation, and video frame prediction demonstrate LARP’s ability to achieve state-of-the-art FVD scores. The promising results of LARP not only highlight its efficacy in video generation tasks but also suggest its potential for broader applications, including the development of multimodal large language models (MLLMs) to handle video generation and understanding in a unified framework.

Acknowledgments

This work was partially supported by NSF CAREER Award (#2238769) and an Amazon Research Award (Fall 2023) to AS. The authors acknowledge UMD’s supercomputing resources made available for conducting this research. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright annotation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of NSF, Amazon, or the U.S. Government.

References

  • Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  • Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. Advances in neural information processing systems, 28, 2015.
  • Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Betker et al. (2023) James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  • Blattmann et al. (2023) Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22563–22575, 2023.
  • Brooks et al. (2024) Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. Technical Report, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators.
  • Brown (2020) Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  • Cao et al. (2023) Shiyue Cao, Yueqin Yin, Lianghua Huang, Yu Liu, Xin Zhao, Deli Zhao, and Kaigi Huang. Efficient-vqgan: Towards high-resolution image generation with efficient vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7368–7377, 2023.
  • Carion et al. (2020) Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European conference on computer vision, pp.  213–229. Springer, 2020.
  • Carreira et al. (2018) Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  • Chang et al. (2022) Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11315–11325, 2022.
  • Chen et al. (2020) Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. In International conference on machine learning, pp.  1691–1703. PMLR, 2020.
  • Devlin (2018) Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dhariwal & Nichol (2021) Prafulla Dhariwal and Alexander Nichol. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  • Dosovitskiy (2020) Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  • Esser et al. (2021) Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12873–12883, 2021.
  • Ge et al. (2022) Songwei Ge, Thomas Hayes, Harry Yang, Xi Yin, Guan Pang, David Jacobs, Jia-Bin Huang, and Devi Parikh. Long video generation with time-agnostic vqgan and time-sensitive transformer. In European Conference on Computer Vision, pp.  102–118. Springer, 2022.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Google et al. (2023) Gemini Team Google, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  • Gu et al. (2024) Yuchao Gu, Xintao Wang, Yixiao Ge, Ying Shan, and Mike Zheng Shou. Rethinking the objectives of vector-quantized tokenizers for image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7631–7640, 2024.
  • Hendrycks et al. (2020) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  • Henighan et al. (2020) Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  • Ho & Salimans (2022) Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  • Ho et al. (2022) Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models. Advances in Neural Information Processing Systems, 35:8633–8646, 2022.
  • Hong et al. (2022) Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers. arXiv preprint arXiv:2205.15868, 2022.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4401–4410, 2019.
  • Karras et al. (2020) Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8110–8119, 2020.
  • Kingma (2013) Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kingma (2014) Diederik P Kingma. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kondratyuk et al. (2023) Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Rachel Hornung, Hartwig Adam, Hassan Akbari, Yair Alon, Vighnesh Birodkar, et al. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  • Lee et al. (2022) Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11523–11532, 2022.
  • Li et al. (2023) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp.  19730–19742. PMLR, 2023.
  • Loshchilov (2017) I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Lu et al. (2022) Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022.
  • Luo et al. (2023) Zhengxiong Luo, Dayou Chen, Yingya Zhang, Yan Huang, Liang Wang, Yujun Shen, Deli Zhao, Jingren Zhou, and Tieniu Tan. Videofusion: Decomposed diffusion models for high-quality video generation. arXiv preprint arXiv:2303.08320, 2023.
  • Mihaylova & Martins (2019) Tsvetomila Mihaylova and André FT Martins. Scheduled sampling for transformers. arXiv preprint arXiv:1906.07651, 2019.
  • Peebles & Xie (2023) William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  • Podell et al. (2023) Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • Rae et al. (2021) Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  • Ranzato et al. (2015) Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
  • Razavi et al. (2019) Ali Razavi, Aaron Van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32, 2019.
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  • Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  • Skorokhodov et al. (2022) Ivan Skorokhodov, Sergey Tulyakov, and Mohamed Elhoseiny. Stylegan-v: A continuous video generator with the price, image quality and perks of stylegan2. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  3626–3636, 2022.
  • Skorokhodov et al. (2024) Ivan Skorokhodov, Willi Menapace, Aliaksandr Siarohin, and Sergey Tulyakov. Hierarchical patch diffusion models for high-resolution video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7569–7579, 2024.
  • Song et al. (2020) Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  • Soomro (2012) K Soomro. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • Sun et al. (2024) Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv preprint arXiv:2406.06525, 2024.
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Tseng et al. (2021) Hung-Yu Tseng, Lu Jiang, Ce Liu, Ming-Hsuan Yang, and Weilong Yang. Regularizing generative adversarial networks under limited data. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  7921–7931, 2021.
  • Unterthiner et al. (2018) Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  • Van den Oord et al. (2016) Aaron Van den Oord, Nal Kalchbrenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Conditional image generation with pixelcnn decoders. Advances in neural information processing systems, 29, 2016.
  • Van Den Oord et al. (2016) Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International conference on machine learning, pp.  1747–1756. PMLR, 2016.
  • Van Den Oord et al. (2017) Aaron Van Den Oord, Oriol Vinyals, et al. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  • Vaswani (2017) A Vaswani. Attention is all you need. Advances in Neural Information Processing Systems, 2017.
  • Villegas et al. (2022) Ruben Villegas, Mohammad Babaeizadeh, Pieter-Jan Kindermans, Hernan Moraldo, Han Zhang, Mohammad Taghi Saffar, Santiago Castro, Julius Kunze, and Dumitru Erhan. Phenaki: Variable length video generation from open domain textual descriptions. In International Conference on Learning Representations, 2022.
  • Wang et al. (2024) Junke Wang, Yi Jiang, Zehuan Yuan, Binyue Peng, Zuxuan Wu, and Yu-Gang Jiang. Omnitokenizer: A joint image-video tokenizer for visual generation. arXiv preprint arXiv:2406.09399, 2024.
  • Yan et al. (2021) Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers. arXiv preprint arXiv:2104.10157, 2021.
  • Yu et al. (2021) Jiahui Yu, Xin Li, Jing Yu Koh, Han Zhang, Ruoming Pang, James Qin, Alexander Ku, Yuanzhong Xu, Jason Baldridge, and Yonghui Wu. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  • Yu et al. (2023a) Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, et al. Magvit: Masked generative video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10459–10469, 2023a.
  • Yu et al. (2023b) Lijun Yu, José Lezama, Nitesh B Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Agrim Gupta, Xiuye Gu, Alexander G Hauptmann, et al. Language model beats diffusion–tokenizer is key to visual generation. arXiv preprint arXiv:2310.05737, 2023b.
  • Yu et al. (2024) Qihang Yu, Mark Weber, Xueqing Deng, Xiaohui Shen, Daniel Cremers, and Liang-Chieh Chen. An image is worth 32 tokens for reconstruction and generation. arXiv preprint arXiv:2406.07550, 2024.
  • Zhang et al. (2023) Jiahui Zhang, Fangneng Zhan, Christian Theobalt, and Shijian Lu. Regularized vector quantization for tokenized image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  18467–18476, 2023.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  586–595, 2018.
  • Zheng et al. (2022) Chuanxia Zheng, Tung-Long Vuong, Jianfei Cai, and Dinh Phung. Movq: Modulating quantized vectors for high-fidelity image generation. Advances in Neural Information Processing Systems, 35:23412–23425, 2022.
  • Zheng et al. (2024) Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, and Zongqing Lu. Unicode: Learning a unified codebook for multimodal large language models. arXiv preprint arXiv:2403.09072, 2024.

Appendix A Additional Implementation Details

A.1 Additional Implementation Details of the LARP Tokenizer.

During the training of LARP, a GAN loss (Goodfellow et al., 2014) is employed to enhance reconstruction quality. We use a ViT-based discriminator (Dosovitskiy, 2020) with identical patchify settings to those of the LARP tokenizer. The discriminator is updated once for every five updates of the LARP tokenizer and is trained with a learning rate that is 30% of the LARP tokenizer’s learning rate. To stabilize discriminator training, LeCam regularization (Tseng et al., 2021) is applied, following the approach of Yu et al. (2023a). A GAN loss weight of 0.3 is used throughout the training.

Fixed sin-cos positional encoding (Vaswani, 2017) is used in both the encoder and decoder of LARP. In the encoder, fixed 3D positional encoding is applied to each video patch, while in the decoder, fixed 1D positional encoding is added to each holistic token. Notably, since the patch queries and holistic queries are position-wise learnable parameters, they do not require additional positional encodings.

In the SVQ module, we set the total quantization loss weight to 0.1. Additionally, we follow Esser et al. (2021) by using a commitment loss weight of 0.25 and a codebook loss weight of 1.0. Both the L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT reconstruction loss and the LPIPS perceptual loss are assigned a weight of 1.0.

In most experiments, we train the LARP tokenizer for 75 epochs on a combined dataset of UCF-101 and K600 with a batch size of 64, totaling approximately 500k training steps. Random horizontal flipping is used as a data augmentation technique. Specifically, LARP-L-Long in Table 1 is trained for 150 epochs with a batch size of 128.

The Adam optimizer (Kingma, 2014) is used with a base learning rate of 1e41𝑒41e-41 italic_e - 4, β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, and β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95, following a warm-up cosine learning rate schedule.

A.2 Additional implementation details of the AR generative model

We use Llama-like transformers as our AR generative models. Unlike the original implementation and Sun et al. (2024), we utilize absolute learned positional encodings. A token dropout probability of 0.1 is applied during training, with both residual and feedforward dropout probabilities also set to 0.1. Additionally, when training the AR generative models, the SVQ module of the LARP tokenizer is set to be deterministic, ensuring a more accurate latent representation.

Our default AR generative model consists of 632M parameters, as specified in Table 1. It is trained on the training split of the UCF-101 dataset for 1000 epochs with a batch size of 32. The model used in the last row of Table 1, which also has 632M parameters, is trained for 3000 epochs on UCF-101 with a batch size of 64.

The AdamW optimizer (Loshchilov, 2017) is used with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95, a weight decay of 0.05, and a base learning rate of 6e46𝑒46e-46 italic_e - 4, following a warm-up cosine learning rate schedule.

When generating videos, we apply a small Classifier-Free Guidance (CFG) scale of 1.25 (Ho & Salimans, 2022). We do not use top-k or top-p sampling methods.

Appendix B Additional Visualization Results

B.1 Video Reconstruction Comparison

Additional video reconstruction results are provided in Figure 7. Across a variety of scenes and regions, LARP consistently demonstrates superior reconstruction quality compared to OmniTokenizer Wang et al. (2024).

B.2 Class-conditional video generation on UCF-101 dataset

We provide additional class-conditional video generation results in Figure 8. These results further demonstrate LARP’s ability to generate high-quality videos with both strong per-frame fidelity and temporal consistency across various action classes in the UCF-101 dataset. The generated videos show diverse scene dynamics, capturing fine-grained details and natural motion, highlighting LARP’s effectiveness in handling complex generative tasks within this challenging dataset. I

Generated video files (in MP4 format) are available in the supplementary materials.

B.3 Video Frame Prediction on K600 Dataset

We present additional video frame prediction results in Figure 9, further demonstrating LARP’s capacity to accurately predict future frames in the K600 dataset. These results showcase LARP’s ability to handle a wide range of dynamic scenes, capturing temporal dependencies with natural motion and smooth transitions between predicted frames. The predictions highlight LARP’s effectiveness in scenarios involving complex motion and scene diversity, underscoring its strong generalization capabilities in video frame prediction tasks.

The predicted frames and the ground truth videos (in MP4 format) are available in the supplementary materials.

Refer to caption
Figure 7: Additional video reconstruction comparison with OmniTokenizer (Wang et al., 2024).
Refer to caption
Figure 8: Additional class-conditional generation results on UCF-101 dataset.
Refer to caption
Figure 9: Additional video frame prediction results on K600 dataset.