CamContextI2V: Context-aware Controllable Video Generation

Luis Denninger1, Sina Mokhtarzadeh Azar1, Juergen Gall1,2
1University of Bonn, 2Lamarr Institute for Machine Learning and Artificial Intelligence
Abstract

Recently, image-to-video (I2V) diffusion models have demonstrated impressive scene understanding and generative quality, incorporating image conditions to guide generation. However, these models primarily animate static images without extending beyond their provided context. Introducing additional constraints, such as camera trajectories, can enhance diversity but often degrades visual quality, limiting their applicability for tasks requiring faithful scene representation. We propose CamContextI2V, an I2V model that integrates multiple image conditions with 3D constraints alongside camera control to enrich both global semantics and fine-grained visual details. This enables more coherent and context-aware video generation. Moreover, we motivate the necessity of temporal awareness for an effective context representation. Our comprehensive study on the RealEstate10K dataset demonstrates improvements in visual quality and camera controllability. We make our code and models publicly available at: https://github.com/LDenninger/CamContextI2V.

1 Introduction

Diffusion models have become a prominent approach for video generation producing high-quality videos based on user inputs. To make such approaches attractive for digital content creation, controllability achieved through specific conditioning of the generations, like human poses [26, 18], style [10, 29], motion [22, 19] or camera trajectories [32, 7, 25] have been a widely studied topic.

While text-to-video (T2V) diffusion models like VideoCrafter [3] or CogVideoX [28] have full freedom over the visual design, more recent image-to-video (I2V) models employ an image to convey style and scene context. Due to the typically short duration (<2absent2<2< 2 seconds) of the generated videos, the image provides sufficient context to define the scene to render. With the ultimate objective of matching the generative quality and capabilities of traditional rendering engines, these approaches still require further development to achieve a fine-grained control over style, motion and scene composition, to allow for fully customizable video creation.

Refer to caption
Figure 1: CamContextI2V performs context-aware generation provided a reference frame representing the initial frame and [1-4] additional views providing crucial context to the diffusion process missing in the reference frame.

As illustrated in Fig. 1, the initial reference frame alone provides only limited context for the diffusion process. Once the camera pans, the visual quality degrades and arbitrary interpretations of the scene by the diffusion model become evident. To address this, we introduce CamContextI2V, a novel conditioning mechanism that allows users to supply multiple context views, ensuring a comprehensive definition of the scene in which the video is generated.

Our proposed context-aware encoder integrates these context views into two complementary streams: a high-level semantic stream and a 3D-aware visual stream. This dual-stream approach provides the diffusion model with both a global semantic context and a detailed pixel-level visual embedding. By inserting 3D geometric constraints in the feature aggregation, we effectively retrieve important features from the context while filtering out irrelevant ones. This allows our method to considerably enhance the visual coherence of existing approaches. In summary, our key contributions are as follows.

  • We propose CamContextI2V, a camera-controllable context-aware diffusion model, which conditions the diffusion process on multiple context frames through a dual-stream encoder retrieving high-level semantic features and low-level visual cues from the context.

  • We introduce a 3D-aware cross-attention mechanism leveraging epipolar constraints to effectively retrieve context from posed images.

  • Our temporally-aware embedding strategy better aligns the context at different frame timesteps.

  • Our method achieves a 24.09% improvement in visual quality over the state-of-the-art methods on RealEstate10K dataset.

2 Related Works

Diffusion-based Video Generation.

Originally developed for image generation [8, 15], diffusion models have since demonstrated great success synthesizing high-quality videos [9, 2]. Models such as SVD [1], LAVIE [20] or VideoCrafter [3] have shown great success in distilling text-to-video (T2V) diffusion models from text-to-image (T2I) diffusion models by inserting temporal attention blocks modeling the added time dimension. Building on top, models like DynamiCrafter [24], Seine [4] or I2vgen-XL [31] further fine-tune these models for image-to-video (I2V) generation showing impressive results.

Camera-controllable Video Generation.

Concurrent work also focuses on adding camera control to diffusion models allowing the user to define the trajectory along a video is generated. While initial work such as MotionCtrl [22], AnimateDiff [6] or Direct-a-Video [27] model camera movements through camera-motion primitives, recent approaches such as CameraCtrl [7], CamCo [25] or CamI2V [32] directly insert the camera poses showcasing fine-grained camera control. A key is the dense supervisory signal, such as pixel-wise camera rays represented through Plücker coordinates, which are encoded and inserted into the diffusion model in a ControlNet-like fashion [30].

CamCo and CamI2V further demonstrate that epipolar geometry can serve as an effective constraint in the information aggregation of vanilla attention mechanism. While CamCo employs cross-attention to constraint the feature aggregation from the condition frame, CamI2V constrain the temporal self-attention itself to guide the diffusion process and thus improving the 3D consistency and camera trajectory.

Multi-Image Condition.

Large camera movements or longer generations result in multiple scenes being generated in one video which is insufficiently represented through a singular reference image typically employed in concurrent image-to-video diffusion models [24, 4, 31, 28]. Models like Gen-L-Video [17], MEVG [13] or VideoStudio [11] explore the insertion of multiple text prompts to give a broader context across the temporal domain for longer video generation. This is achieved by generating distinct short videos with different text conditions and optimizing the the noise between them either in a divide-and-conquer or auto-regressive setup to generate long consistent videos. Other approaches like 4DiM [23] or Seine [4] explore the insertion of multiple image conditions to induce motion cues or as key-frames to interpolate between.

3 Preliminaries

Before we describe in Section 4 our novel method, which enhances the context-awareness of pre-trained diffusion models by conditioning on multiple context views rather than a single reference frame, we briefly describe components of our baseline model, CamI2V, which extends DynamiCrafter [24], a latent image-to-video diffusion model with camera pose conditioning.

Refer to caption
Figure 2: CamContextI2V pipeline. Our pipeline generates videos conditioned on a reference image, an optional text description and a camera trajectory encoded through a camera pose encoder conditioning. Additionally, frames are encoded in two parallel streams, one providing pixel-level visual cues and the other a global context. The pixel-level stream employs epipolar attention to enforce 3D consistent feature aggregation. Finally, both stream are augmented with a timestep embedding to ensure timestep-wise conditioning of the diffusion process.

Latent Video Diffusion Models.

Latent video diffusion models learn a latent video data distribution by gradually reconstructing noisy latents ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT sampled from a Gaussian distribution:

q(zt|zt1)=𝒩(zt;1βt,βt𝐈),𝑞conditionalsubscript𝑧𝑡subscript𝑧𝑡1𝒩subscript𝑧𝑡1subscript𝛽𝑡subscript𝛽𝑡𝐈q(z_{t}|z_{t-1})=\mathcal{N}(z_{t};\sqrt{1-\beta_{t}},\beta_{t}\mathbf{I}),italic_q ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_z start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) = caligraphic_N ( italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; square-root start_ARG 1 - italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG , italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_I ) , (1)

where hyperparameters βtsubscript𝛽𝑡\beta_{t}italic_β start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT determine the level of noise added at each timestep. The latent space is defined through a pre-trained auto-encoder, e.g. a pre-trained VQGAN [5] for DynamiCrafter, consisting of an encoder \mathcal{E}caligraphic_E and a decoder 𝒟𝒟\mathcal{D}caligraphic_D. Conditioned on a text condition ctextsubscript𝑐textc_{\text{text}}italic_c start_POSTSUBSCRIPT text end_POSTSUBSCRIPT and a reference image cimgsubscript𝑐imgc_{\text{img}}italic_c start_POSTSUBSCRIPT img end_POSTSUBSCRIPT, the diffusion model ϵΘsubscriptitalic-ϵΘ\epsilon_{\Theta}italic_ϵ start_POSTSUBSCRIPT roman_Θ end_POSTSUBSCRIPT is then trained to predict the noise ϵitalic-ϵ\epsilonitalic_ϵ at timestep t𝒰(0,T)𝑡𝒰0𝑇t\in{\mathcal{U}(0,T)}italic_t ∈ caligraphic_U ( 0 , italic_T ) using a simple reconstruction loss:

minθ𝔼t,𝐱pdata,ϵ𝒩(𝟎,𝐈)ϵϵθ(𝐱t,𝐜,t)22.subscript𝜃subscript𝔼formulae-sequencesimilar-to𝑡𝐱subscript𝑝𝑑𝑎𝑡𝑎similar-toitalic-ϵ𝒩0𝐈superscriptsubscriptnormitalic-ϵsubscriptitalic-ϵ𝜃subscript𝐱𝑡𝐜𝑡22\min\limits_{\theta}\mathbb{E}_{t,\mathbf{x}\sim p_{data},\epsilon\sim\mathcal% {N}(\mathbf{0},\mathbf{I})}||\epsilon-\epsilon_{\theta}(\mathbf{x}_{t},\mathbf% {c},t)||_{2}^{2}.roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_t , bold_x ∼ italic_p start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT , italic_ϵ ∼ caligraphic_N ( bold_0 , bold_I ) end_POSTSUBSCRIPT | | italic_ϵ - italic_ϵ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_c , italic_t ) | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (2)

The diffusion model itself is typically implemented as a UNet, e.g. a 3D-Unet [34] in DynamiCrafter, where θ𝜃\thetaitalic_θ denotes the neural network’s parameters.

Camera Conditioning.

To incorporate camera control, CamI2V employs a dense supervisory signal using pixel-wise embeddings of camera rays, represented via Plücker coordinates. Specifically, for each pixel (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) the Plücker coordinates P=(o×d,d)𝑃𝑜superscript𝑑superscript𝑑P=(o\times d^{\prime},d^{\prime})italic_P = ( italic_o × italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are computed using the normalized ray direction d=ddsuperscript𝑑𝑑norm𝑑d^{\prime}=\frac{d}{||d||}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = divide start_ARG italic_d end_ARG start_ARG | | italic_d | | end_ARG and the ray origin o𝑜oitalic_o (the camera focal point).

The ray direction relative to a reference coordinate frame—such as the camera coordinate system of the initial frame—is derived from the intrinsics 𝐊𝐊\mathbf{K}bold_K and extrinsics E=[𝐑|𝐭]𝐸delimited-[]conditional𝐑𝐭E=[\mathbf{R}|\mathbf{t}]italic_E = [ bold_R | bold_t ] as:

d=𝐑𝐊1+𝐭.𝑑superscript𝐑𝐊1𝐭d=\mathbf{R}\mathbf{K}^{-1}+\mathbf{t}.italic_d = bold_RK start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + bold_t . (3)

These embeddings are further encoded at multiple resolutions and integrated into the epipolar attention blocks inserted into the U-Net.

4 Method

Image-to-video diffusion models generate videos based on a single reference frame cimgsubscript𝑐𝑖𝑚𝑔c_{img}italic_c start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT and an optional text condition ctxtsubscript𝑐𝑡𝑥𝑡c_{txt}italic_c start_POSTSUBSCRIPT italic_t italic_x italic_t end_POSTSUBSCRIPT. Additionally, camera-controlled diffusion models are conditioned on a camera trajectory [Pcam0,,Pcam16]superscriptsubscript𝑃𝑐𝑎𝑚0superscriptsubscript𝑃𝑐𝑎𝑚16[P_{cam}^{0},\dots,P_{cam}^{16}][ italic_P start_POSTSUBSCRIPT italic_c italic_a italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_c italic_a italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 16 end_POSTSUPERSCRIPT ] allowing precise control of the camera view at each timestep. This enforces generations beyond the context conveyed through the reference frame degrading the visual quality. To counteract this, we provide the model with additional context frames cctx0,,cctxNsuperscriptsubscript𝑐𝑐𝑡𝑥0superscriptsubscript𝑐𝑐𝑡𝑥𝑁c_{ctx}^{0},\dots,c_{ctx}^{N}italic_c start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and their poses Pctx0,,PctxNsuperscriptsubscript𝑃𝑐𝑡𝑥0superscriptsubscript𝑃𝑐𝑡𝑥𝑁P_{ctx}^{0},\dots,P_{ctx}^{N}italic_P start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_P start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT that span a rich context utilized in the diffusion process.

Our Context-aware encoder, shown in Fig. 2, extends DynamiCrafter’s Dual-stream Image Injection to support multiple image conditions. Natively, it conditions the model at the pixel level by concatenating reference latents zimgsubscript𝑧𝑖𝑚𝑔z_{img}italic_z start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT with noisy latents ztsubscript𝑧𝑡z_{t}italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT along the channel dimension, which restricts the generations to the narrow context provided by the reference image. Additionally, to better guide the diffusion process, semantic features aggregated from CLIP-embedded image and text conditions are integrated layer-wise through spatial cross-attention. To utilize the pre-trained generative capabilities of the diffusion model and refrain from fine-tuning large parts of the U-Net, we chose to inject our condition in those streams.

Semantic Stream.

We adopt DynamiCrafter’s query transformer semsubscript𝑠𝑒𝑚\mathcal{E}_{sem}caligraphic_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT to integrate cross-modal information from the CLIP-embedded reference image 𝐅imgsubscript𝐅𝑖𝑚𝑔\mathbf{F}_{img}bold_F start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT, the text condition 𝐅txtsubscript𝐅𝑡𝑥𝑡\mathbf{F}_{txt}bold_F start_POSTSUBSCRIPT italic_t italic_x italic_t end_POSTSUBSCRIPT, and additional context frames 𝐅ctx=[Fctx1,,FctxN]subscript𝐅𝑐𝑡𝑥superscriptsubscript𝐹𝑐𝑡𝑥1superscriptsubscript𝐹𝑐𝑡𝑥𝑁\mathbf{F}_{ctx}=[F_{ctx}^{1},\dots,F_{ctx}^{N}]bold_F start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT = [ italic_F start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_F start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ]. Specifically, semsubscript𝑠𝑒𝑚\mathcal{E}_{sem}caligraphic_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT employs learnable latent query tokens 𝐓semsubscript𝐓𝑠𝑒𝑚\mathbf{T}_{sem}bold_T start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT to gather context across multiple layers of cross-attention and feed-forward networks, yielding a global representation:

𝐅sem=sem([𝐅img,𝐅txt,𝐅ctx],𝐓sem).subscript𝐅𝑠𝑒𝑚subscript𝑠𝑒𝑚subscript𝐅𝑖𝑚𝑔subscript𝐅𝑡𝑥𝑡subscript𝐅𝑐𝑡𝑥subscript𝐓𝑠𝑒𝑚\mathbf{F}_{sem}=\mathcal{E}_{sem}([\mathbf{F}_{img},\mathbf{F}_{txt},\mathbf{% F}_{ctx}],\mathbf{T}_{sem}).bold_F start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT ( [ bold_F start_POSTSUBSCRIPT italic_i italic_m italic_g end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_t italic_x italic_t end_POSTSUBSCRIPT , bold_F start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT ] , bold_T start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT ) . (4)

To preserve strong cross-modal context aggregation, we initialize semsubscript𝑠𝑒𝑚\mathcal{E}_{sem}caligraphic_E start_POSTSUBSCRIPT italic_s italic_e italic_m end_POSTSUBSCRIPT from DynamiCrafter’s Dual-stream Image Injection module and fine-tune it to handle multiple image conditions.

Visual Stream.

While the semantic stream provides a well-suited global context representation, it lacks fine-grained visual details due to CLIP’s inherent training on visual-language alignment, which favors high-level representations of single entities.

To enhance context-aware generation, we integrate our visual condition directly into DynamiCrafter’s image conditioning. Specifically, we embed the context frames cctx0,,cctxNsuperscriptsubscript𝑐𝑐𝑡𝑥0superscriptsubscript𝑐𝑐𝑡𝑥𝑁c_{ctx}^{0},\dots,c_{ctx}^{N}italic_c start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT into the latent space 𝐙ctx=[zctx0,,zctxN]subscript𝐙𝑐𝑡𝑥superscriptsubscript𝑧𝑐𝑡𝑥0superscriptsubscript𝑧𝑐𝑡𝑥𝑁\mathbf{Z}_{ctx}=[z_{ctx}^{0},\dots,z_{ctx}^{N}]bold_Z start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT = [ italic_z start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT , … , italic_z start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ] and introduce pixel-wise learnable context tokens 𝐓visT×h×w×Dsubscript𝐓𝑣𝑖𝑠superscript𝑇𝑤𝐷\mathbf{T}_{vis}\in\mathbb{R}^{T\times h\times w\times D}bold_T start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_h × italic_w × italic_D end_POSTSUPERSCRIPT. The context tokens serve as queries in a query transformer, similar to the semantic stream, to aggregate timestep- and pixel-wise features from the latent context frames.

3D Awareness.

Refer to caption
Figure 3: Epipolar cross-attention. Learnable context tokens act as queries to retrieve pixel-level features for each timestep from context views, masked according to epipolar lines to incorporate 3D geometric constraints.

To introduce 3D awareness, we employ an epipolar cross-attention mechanism which guides the feature aggregation to only consider potentially relevant features. Specifically, each token ti𝐓vissubscript𝑡𝑖subscript𝐓𝑣𝑖𝑠t_{i}\in\mathbf{T}_{vis}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ bold_T start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT, illustrated in Fig. 3, describes a pixel (u,v)𝑢𝑣(u,v)( italic_u , italic_v ) at timestep t𝑡titalic_t. Employing the provided camera pose Pcamtsuperscriptsubscript𝑃𝑐𝑎𝑚𝑡P_{cam}^{t}italic_P start_POSTSUBSCRIPT italic_c italic_a italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT at the given timestep, we can compute the epipolar line lij=Ax+Bx+Csubscript𝑙𝑖𝑗𝐴𝑥𝐵𝑥𝐶l_{ij}=Ax+Bx+Citalic_l start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_A italic_x + italic_B italic_x + italic_C in each context view cctxjsuperscriptsubscript𝑐𝑐𝑡𝑥𝑗c_{ctx}^{j}italic_c start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. Using the point-to-line distance:

d(u,v)=[A,B,C][u,v,1]A2+B2,𝑑superscript𝑢superscript𝑣superscript𝐴𝐵𝐶superscript𝑢superscript𝑣1superscript𝐴2superscript𝐵2d(u^{\prime},v^{\prime})=\frac{[A,B,C]^{\intercal}\cdot[u^{\prime},v^{\prime},% 1]}{\sqrt{A^{2}+B^{2}}},italic_d ( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = divide start_ARG [ italic_A , italic_B , italic_C ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT ⋅ [ italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 ] end_ARG start_ARG square-root start_ARG italic_A start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_B start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG , (5)

we produce the epipolar mask mThw×Nhw𝑚superscript𝑇𝑤𝑁𝑤m\in\mathbb{R}^{Thw\times Nhw}italic_m ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_h italic_w × italic_N italic_h italic_w end_POSTSUPERSCRIPT masking out pixels (u,v)superscript𝑢superscript𝑣(u^{\prime},v^{\prime})( italic_u start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_v start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) with a distance larger than a threshold δ𝛿\deltaitalic_δ, set to half of the diagonal of the latent feature space, in the cross-attention mechanism:

EpiCrossAttn(q,k,v,m)=softmax(qkdm)v,EpiCrossAttn(q,k,v,m)softmaxdirect-product𝑞superscript𝑘𝑑𝑚𝑣\text{EpiCrossAttn(q,k,v,m)}=\text{softmax}(\frac{qk^{\intercal}}{\sqrt{d}}% \odot m)v,EpiCrossAttn(q,k,v,m) = softmax ( divide start_ARG italic_q italic_k start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ⊙ italic_m ) italic_v , (6)

where qThw×D𝑞superscript𝑇𝑤𝐷q\in\mathbb{R}^{Thw\times D}italic_q ∈ blackboard_R start_POSTSUPERSCRIPT italic_T italic_h italic_w × italic_D end_POSTSUPERSCRIPT describes the learnable context queries and k,vNhw×D𝑘𝑣superscript𝑁𝑤𝐷k,v\in\mathbb{R}^{Nhw\times D}italic_k , italic_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_N italic_h italic_w × italic_D end_POSTSUPERSCRIPT the latent embedded context frames.

Temporal Awareness.

The native pixel-level embedding of DynamiCrafter is agnostic to the timestep within the video as each timestep is provided with the same condition. Thus, to further enforce the diffusion model to attend to context provided at specific timesteps, we found it advantageous to employ a sinusoidal timestep embedding. In practice, we concatenate the timestep embedding to our context embeddings before forwarding it through a feed-forward network.

Finally, the visual stream of our context-aware encoder maps a spatially distributed embedding represented through the latent embedding of posed views to a timestep-wise embedding:

𝐅vis=vis(𝐙ctx,𝐓vis,m).subscript𝐅𝑣𝑖𝑠subscript𝑣𝑖𝑠subscript𝐙𝑐𝑡𝑥subscript𝐓𝑣𝑖𝑠𝑚\mathbf{F}_{vis}=\mathcal{E}_{vis}(\mathbf{Z}_{ctx},\mathbf{T}_{vis},m).bold_F start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT = caligraphic_E start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ( bold_Z start_POSTSUBSCRIPT italic_c italic_t italic_x end_POSTSUBSCRIPT , bold_T start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT , italic_m ) . (7)

To retain the reference image as a strong anchor to the generation and smoothly insert the new condition, we employ a 3D zero-convolution which weighs the usage of DynamiCrafter’s native condition zrefsubscript𝑧𝑟𝑒𝑓z_{ref}italic_z start_POSTSUBSCRIPT italic_r italic_e italic_f end_POSTSUBSCRIPT and ours 𝐅vissubscript𝐅𝑣𝑖𝑠\mathbf{F}_{vis}bold_F start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT before adding them together.

5 Experiments

Method FVD \downarrow MSE \downarrow TransErr \downarrow RotErr \downarrow CamMC \downarrow
VideoGPT StyleGAN
MotionCtrl 78.30 64.47 3654.54 2.89 2.04 4.34
CameraCtrl 71.22 58.05 3130.63 2.54 1.84 3.85
CamI2V 71.01 57.90 2692.84 1.79 1.16 2.58
Ours 53.90 45.36 2579.96 1.53 1.09 2.29
Table 1: Quantitative comparison. We compare our method against state-of-the-art camera-controlled diffusion models. The context-awareness of our model improves the visual quality by 24.09%percent24.0924.09\%24.09 % in terms of FVD and MSE over the baseline methods. Moreover, the model follows the camera trajectory more precisely achieving improvements in RotErr, TransErr and CamMC. The results were obtain using 25 DDIM steps with CFG set to 7.5, except for our method performing best with CFG set to 3.5
22224444666688881010101012121212141414141616161600111122223333444455556666103absentsuperscript103\cdot 10^{3}⋅ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPTMSE222244446666888810101010121212121414141416161616000.10.10.10.10.20.20.20.20.30.30.30.30.40.40.40.40.50.50.50.50.60.60.60.6Timestep t𝑡titalic_tSSIMMotionCtrlCameraCtrlCamI2VOurs
Figure 4: Frame-wise quantitative comparison. We compare the quality of each frame depending on the timestep in the video against state-of-the-art methods in terms of MSE and SSIM. While the reference frame provides sufficient visual cues for the initial frames, the visual quality is degrading logarithmically as time progresses and the diffusion model is forced to generate scenes beyond the provided context.
ReferenceContextGTCamI2VOursGTCamI2VOursRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionGTRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionCamI2VRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionOursRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 5: Qualitative comparison. The reference frame only provides an insufficient context, leading to visually degrading video quality and 3D consistency in the baseline. Our method, provided with an additional context frame, is able to faithfully represent the provided scene, improving visual quality beyond the reference frame. Zoom in for more details.

In the following, we thoroughly analyze our method in the setup described in Sec. 5.1. Sec. 5.2 and Sec. 5.3 compares our method to baseline camera-controlled methods and argue the effectiveness of the provided context. Finally, in Sec. 5.4, we investigate our design choices and motivate the importance of 3D and temporal awareness in our method and the complementary effect of our two-stream design. Additionally, we demonstrate the impact of different sampling strategies on our method.

5.1 Setup

Dataset

The RealEstate10K [33] comprises approximately 70K video clips at 720p of static scenes depicting indoor and outdoor house tours. The clips are annotated with camera extrinsic and intrinsic values obtained through the ORB-SLAM2 [12] pipeline. Additionally, we use the captions provided by the authors of CameraCtrl [7]. The video clips are then center-cropped to a size of 256×256256256256\times 256256 × 256 and clipped to short frames of length 16161616 with a stride sampled between 1 and 10.

Metrics

We evaluate our method with respect to generative quality, the faithfullness to the provided context and the camera trajectory. Firstly, to ensure improved visual quality we report the Frechet Video Distance (FVD) [16]. To ensure the faithfulness with respect to the additional context, we evaluate the pixel-wise mean squared error (MSE) and the Structural Similarity Index (SSIM) [21] independently for each timestep.

Finally, to examine the generated camera trajectory we follow the evaluation paradigm proposed by CameraCtrl and CamI2V. Using the structure-from-motion pipeline GLOMAP [14], we estimated the camera rotation R~isubscript~𝑅𝑖\tilde{R}_{i}over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and translation T~isubscript~𝑇𝑖\tilde{T}_{i}over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for each camera i𝑖iitalic_i and compute the independent rotation and translation errors, RotErr and TransErr respectively, as well as the combined element-wise error CamMC:

RotErr =i=1ncos1tr(R~iRiT)12,absentsuperscriptsubscript𝑖1𝑛superscript1trsubscript~𝑅𝑖superscriptsubscript𝑅𝑖𝑇12\displaystyle=\sum\limits_{i=1}^{n}\cos^{-1}\frac{\text{tr}(\tilde{R}_{i}R_{i}% ^{T})-1}{2},= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_cos start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG tr ( over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ) - 1 end_ARG start_ARG 2 end_ARG , (8)
TransErr =i=1nT~iTi2,absentsuperscriptsubscript𝑖1𝑛subscriptnormsubscript~𝑇𝑖subscript𝑇𝑖2\displaystyle=\sum\limits_{i=1}^{n}||\tilde{T}_{i}-T_{i}||_{2},= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | | over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (9)
CamMC =i=1n||[R~i|T~i]Ri|Ti]||2.\displaystyle=\sum\limits_{i=1}^{n}||[\tilde{R}_{i}|\tilde{T}_{i}]-R_{i}|T_{i}% ]||_{2}.= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | | [ over~ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | over~ start_ARG italic_T end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] - italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] | | start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (10)

To counteract the randomness of GLOMAP, we follow the scheme to average the metrics over successful runs out of 5 trials. All metrics are computed on a subset consisting of videos extending over a duration of over 30303030 seconds to ensure sufficient additional context to be sampled from and prevent avoid sampling to close to the 16161616 frame clip.

Implementation Details

Our model builds on DynamiCrafter with inserted camera control from CamI2V. Training is initialized from CamI2V checkpoints at 50K iterations, freezing all parameters except for our Context-aware Encoder. We train for 50K iterations at a resolution of 256×256256256256\times 256256 × 256, using the Adam optimizer with a fixed learning rate of 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and a batch size of 64646464. Following the baseline, we chose Lightning as our training framework with mixed-precision using DeepSpeed ZeRo-1 on 4 NVIDIA A100 GPUs for approximately 7 days. For comparison to the baseline methods we use the re-implementations of MotionCtrl [22] and CameraCtrl [7] provided by the authors of CamI2V. We sample 1-4 context frames uniformly from the complete videos during training. For fairness, during evaluation we only sample from outside the 16161616 frames window to be generated.

ReferenceFurthestEnd+1GTFurthestEnd+1CombinedRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 6: Qualitative results of different sampling strategies. We generate samples conditioned on the furthest frame providing minimal context and the frame immediately following the video providing maximal context. Our method is able to reject unrelated features from the furthest frame and only aggregate features from the end + 1 frame providing additional information to the diffusion process.

5.2 Quantitative Comparison

To show the effectiveness of the additional context provided by our method, we compare against several camera-controlled methods, namely MotionCtrl [22], CameraCtrl [7] and CamI2V [32]. Tab. 1 presents the comparison of our method against the baseline methods. Our model achieves an improvement of 24.09%percent24.0924.09\%24.09 % in terms of the FVD score highlighting the effectiveness of added context for video generation.

To further evaluate the context-awareness of our method, we report the MSE in Fig. 4 between the generated videos and the ground-truth videos on a per-frame basis to assess the improvement especially for later frames that typically lack sufficient context from the reference frame. Additionally, to assess the visual quality of each frame, we compute the SSIM metric on each frame.

It is visible that the visual quality degrades logarithmically with the video length as the diffusion model lacks sufficient context. Our method outperforms the baseline methods in both MSE and SSIM, especially for later frames. This shows that providing the diffusion process with additional context can stabilize the generative quality over time.

Additionally, we investigate the accuracy of the generated camera trajectory with respect to the RotErr, TransErr and CamMC. We observe a slightly improved rotational error compared to CamI2V’s, indicating an improved camera trajectory of our method. As the evaluation pipeline, GLOMAP, used for estimating the camera trajectory matches keypoint features to simulatenously estimate the camera trajectory and reconstruct a 3D scene using bundle adjustment and we do not train the camera encoder, nor the diffusion model itself, this improved camera trajectory is mainly linked to an improved 3D consistency and visual quality of the generated scene. This demonstrates that the additionally provided context enforces are more faithful representation of the 3D scene.

5.3 Qualitative Comparison

Fig. 5 shows different samples from our method compared against CamI2V. It is evident that the reference frame does not provide sufficient context for the generation past the first few frames. This results in visually degrading image quality and unrealistic generations of the baseline method.

In contrast, our method is provided with an additional context frame sampled from a later timestep past the 16161616 window frame that shows entities outside of the field of view of the reference frame or obstructed by obstacles. Our method is able to comprehend the position of these entities in space and effectively embed it into the timestep-wise embedding resulting in these objects being placed at correct locations in later frames. Moreover, it is visible, while the baseline method produces artifacts not visible in its condition, the extended context provides an additional constraint preventing unwanted artifacts.

5.4 Ablation Studies

Multi-Cond. Epipolar Time FVD \downarrow MSE \downarrow
Pixel Sem. VideoGPT StyleGAN Total t=2 t=16
76.00 63.40 2622.32 632.94 4141.67
70.44 59.56 2810.75 862.84 4225.31
61.61 52.04 2678.45 782.86 4102.77
58.15 47.73 2642.69 753.36 4014.67
53.90 45.36 2579.96 668.60 4076.78
Table 2: Ablation studies. We compare our design choices in different studies showing that our two-stream design complementarily embeds the context and guides the diffusion process. Moreover, the 3D and temporal awareness induced into the embedding by our method is beneficial for an effective context-aware conditioning.

To thoroughly evaluate the impact of our design choices, we conducted several ablation studies. The results are summarized in Tab. 2.

Semantic and visual stream.

First, we examined the individual contributions of the semantic and visual streams to the diffusion process. We trained two model variants, each utilizing only one stream to inject additional context. Despite both variants being provided with an extended context, neither improved upon the baseline results. This limited improvement likely stems from DynamiCrafter being originally trained under matching conditions. In contrast, combining both semantic and visual streams significantly enhanced performance, highlighting their complementary interaction.

3D awareness.

Next, we evaluated the effectiveness of our method’s 3D awareness, achieved through the epipolar cross-attention mechanism. Replacing epipolar cross-attention with standard (vanilla) cross-attention, allowing unrestricted feature aggregation from all tokens, still yielded a considerable improvement of 9.59.59.59.5 FVD points over the baseline. This model variant, still, demonstrates a significant improvement on the baseline by 9.59.59.59.5 points in the FVD score but fails to match the performance of the 3D-aware model variant. This can be contributed to the model still leveraging the additional context for the generation but failing to reject features from invalid positions, as seen in Fig. 6, especially when context frames provide minimal additional information due to them being sampled from distant regions.

Temporal awareness.

Further, we assess the effect of temporal embeddings integrated into semantic and visual streams. Removing temporal embeddings results in a performance decline, although still outperforming CamI2V considerably. The temporal embeddings, particularly within the visual stream, explicitly guide the temporal attention of the U-Net to properly interpret timestep-specific context. Without this guidance, the epipolar cross-attention timestep-wise embedded context may be interpreted freely, resulting in a impaired performance.

Sample Range FVD (VideoGPT) MSE
(end,1]end1(\text{end},-1]( end , - 1 ] 45.63 2579.96
end+1end1\text{end}+1end + 1 44.21 2474.28
Furthest 48.52 2668.91
Table 3: Condition Sampling Study. To investigate the impact of different context views, we condition our method using different context sampling strategies. (end,-1] represents the sampling strategy used through our evaluations, while end+1 provides context with the maximal amount of information and furthest with the minimal amount of information.

Context sampling.

Lastly, Tab. 3 compares different sampling strategies for additional context views. Our default method samples context frames from the interval (end,1]end1(\text{end},-1]( end , - 1 ] following the generated video. Further, we investigate two extremes: First, sampling a completely unrelated frame, the furthest frame, as can be seen in Fig. 6. Our results show that this only slightly degrades the visual quality, indicating that our method effectively rejects unrelated features through the induced 3D awareness of the epipolar cross-attention. Second, sampling a frame directly following the video, providing a maximal amount of information to the diffusion process. This only slightly improves our method, showing that it can effectively gather context from loosely placed context views. The qualitative results in Fig. 6 show that our context-aware encoder effectively sorts out unrelated information and provides the diffusion process only with the necessary context.

6 Conclusion

This paper introduces CamContextI2V, a novel conditioning mechanism that provides the diffusion process with extensive contextual information derived from multiple context views. Unlike conventional image-to-video diffusion models, which typically rely on a single reference image, our proposed method employs a context-aware encoder that encodes additional context through a high-level semantic stream and a 3D-aware visual stream, generating a global semantic representation and a dense, pixel-wise visual embedding from context views. This encoder effectively aggregates relevant features complementing the generation while filtering out unrelated information. As a result, our approach significantly improves visual quality and enhances the accuracy of the generated camera trajectories compared to existing baseline methods.

References

  • Blattmann et al. [2023a] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023a.
  • Blattmann et al. [2023b] Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. Stable video diffusion: Scaling latent video diffusion models to large datasets, 2023b.
  • Chen et al. [2023a] Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation, 2023a.
  • Chen et al. [2023b] Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. Seine: Short-to-long video diffusion model for generative transition and prediction, 2023b.
  • Esser et al. [2021] Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2021.
  • Guo et al. [2024] Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning, 2024.
  • He et al. [2024] Hao He, Yinghao Xu, Yuwei Guo, Gordon Wetzstein, Bo Dai, Hongsheng Li, and Ceyuan Yang. Cameractrl: Enabling camera control for text-to-video generation, 2024.
  • Ho et al. [2020] Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models, 2020.
  • Ho et al. [2022] Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J. Fleet. Video diffusion models, 2022.
  • Liu et al. [2024] Gongye Liu, Menghan Xia, Yong Zhang, Haoxin Chen, Jinbo Xing, Yibo Wang, Xintao Wang, Yujiu Yang, and Ying Shan. Stylecrafter: Enhancing stylized text-to-video generation with style adapter, 2024.
  • Long et al. [2024] Fuchen Long, Zhaofan Qiu, Ting Yao, and Tao Mei. Videostudio: Generating consistent-content and multi-scene videos, 2024.
  • Mur-Artal and Tardos [2017] Raul Mur-Artal and Juan D. Tardos. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
  • Oh et al. [2024] Gyeongrok Oh, Jaehwan Jeong, Sieun Kim, Wonmin Byeon, Jinkyu Kim, Sungwoong Kim, and Sangpil Kim. Mevg: Multi-event video generation with text-to-video models, 2024.
  • Pan et al. [2024] Linfei Pan, Dániel Baráth, Marc Pollefeys, and Johannes L. Schönberger. Global structure-from-motion revisited, 2024.
  • Rombach et al. [2022] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2022.
  • Unterthiner et al. [2019] Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges, 2019.
  • Wang et al. [2023a] Fu-Yun Wang, Wenshuo Chen, Guanglu Song, Han-Jia Ye, Yu Liu, and Hongsheng Li. Gen-l-video: Multi-text to long video generation via temporal co-denoising, 2023a.
  • Wang et al. [2024a] Tan Wang, Linjie Li, Kevin Lin, Yuanhao Zhai, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, and Lijuan Wang. Disco: Disentangled control for realistic human dance generation, 2024a.
  • Wang et al. [2023b] Xiang Wang, Hangjie Yuan, Shiwei Zhang, Dayou Chen, Jiuniu Wang, Yingya Zhang, Yujun Shen, Deli Zhao, and Jingren Zhou. Videocomposer: Compositional video synthesis with motion controllability, 2023b.
  • Wang et al. [2023c] Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, Yuwei Guo, Tianxing Wu, Chenyang Si, Yuming Jiang, Cunjian Chen, Chen Change Loy, Bo Dai, Dahua Lin, Yu Qiao, and Ziwei Liu. Lavie: High-quality video generation with cascaded latent diffusion models, 2023c.
  • Wang et al. [2004] Zhou Wang, A.C. Bovik, H.R. Sheikh, and E.P. Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612, 2004.
  • Wang et al. [2024b] Zhouxia Wang, Ziyang Yuan, Xintao Wang, Tianshui Chen, Menghan Xia, Ping Luo, and Ying Shan. Motionctrl: A unified and flexible motion controller for video generation, 2024b.
  • Watson et al. [2024] Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, and David J. Fleet. Controlling space and time with diffusion models, 2024.
  • Xing et al. [2023] Jinbo Xing, Menghan Xia, Yong Zhang, Haoxin Chen, Wangbo Yu, Hanyuan Liu, Xintao Wang, Tien-Tsin Wong, and Ying Shan. Dynamicrafter: Animating open-domain images with video diffusion priors, 2023.
  • Xu et al. [2024] Dejia Xu, Weili Nie, Chao Liu, Sifei Liu, Jan Kautz, Zhangyang Wang, and Arash Vahdat. Camco: Camera-controllable 3d-consistent image-to-video generation, 2024.
  • Xu et al. [2023] Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, and Mike Zheng Shou. Magicanimate: Temporally consistent human image animation using diffusion model, 2023.
  • Yang et al. [2024a] Shiyuan Yang, Liang Hou, Haibin Huang, Chongyang Ma, Pengfei Wan, Di Zhang, Xiaodong Chen, and Jing Liao. Direct-a-video: Customized video generation with user-directed camera movement and object motion. In Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers ’24, page 1–12. ACM, 2024a.
  • Yang et al. [2024b] Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, and Jie Tang. Cogvideox: Text-to-video diffusion models with an expert transformer, 2024b.
  • Ye et al. [2024] Zixuan Ye, Huijuan Huang, Xintao Wang, Pengfei Wan, Di Zhang, and Wenhan Luo. Stylemaster: Stylize your video with artistic generation and translation, 2024.
  • Zhang et al. [2023a] Lvmin Zhang, Anyi Rao, and Maneesh Agrawala. Adding conditional control to text-to-image diffusion models, 2023a.
  • Zhang et al. [2023b] Shiwei Zhang, Jiayu Wang, Yingya Zhang, Kang Zhao, Hangjie Yuan, Zhiwu Qin, Xiang Wang, Deli Zhao, and Jingren Zhou. I2vgen-xl: High-quality image-to-video synthesis via cascaded diffusion models, 2023b.
  • Zheng et al. [2024] Guangcong Zheng, Teng Li, Rui Jiang, Yehao Lu, Tao Wu, and Xi Li. Cami2v: Camera-controlled image-to-video diffusion model, 2024.
  • Zhou et al. [2018] Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images, 2018.
  • Özgün Çiçek et al. [2016] Özgün Çiçek, Ahmed Abdulkadir, Soeren S. Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: Learning dense volumetric segmentation from sparse annotation, 2016.
\thetitle

Supplementary Material

7 Additional Qualitative Results

We provide additional qualitative results of our method in Fig. 7.

ReferenceContextGTOursGTOursGTOursRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to captionRefer to caption
Figure 7: Additional Qualitative results. Zoom in for more details

8 Camera Evaluation

To ensure comparability with the evaluation paradigm proposed by CamI2V, we follow their configuration of the GLOMAP pipeline. The adjusted configuration parameters are listed in Tab. 4. The remaining parameters are set to the default parameters as provided by GLOMAP.

Stage Parameter Value
Feature Extraction ImageReader.single_camera 1
ImageReader.camera_model SIMPLE_PINHOLE
ImageReader.camera_params {f},{cx},{cy}𝑓𝑐𝑥𝑐𝑦\{f\},\{cx\},\{cy\}{ italic_f } , { italic_c italic_x } , { italic_c italic_y }
SiftExtraction.estimate_affine_shape 1
SiftExtraction.domain_size_pooling 1
Sequential Matcher SiftMatching.guided_matching 1
SiftMatching.max_num_matches 65536
Mapper RelPoseEstimation.max_epipolar_error 4
BundleAdjustment.optimize_intrinsics 0
Table 4: Changed parameters of the Glomap pipeline used in our evaluation.