SierpinskiCam: Camera-Controlled Video Retaking

Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video — dubbed video retaking — is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing.

We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contain rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism (NegRoPE) that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation.

Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios.

Method

SierpinskiCam tackles two core conditioning questions in video retaking: how to inject a target camera trajectory (for precise viewpoint control) and how to inject the source video (for faithful appearance preservation).

Figure: SierpinskiCam pipeline. Given a source video and target camera trajectory, we (1) reconstruct geometry-based proxies via depth and point tracks, (2) fill unobserved regions with a Sierpinski-textured dome, and (3) inject the source video via NegRoPE for appearance grounding during diffusion.

Contribution 1

Sierpinski Textured Dome

When the target camera reveals regions outside the original observation, geometry-based guidance becomes sparse or ambiguous. We add a Sierpinski fractal texture to the surrounding dome so that newly visible regions still contain multi-scale, trackable visual cues. These cues make the target camera motion easier to infer and help the diffusion model maintain stable geometry under large viewpoint changes.

Contribution 2

NegRoPE: Negative Rotary Position Embedding

Source and target video tokens are concatenated into a shared transformer sequence — but if they share the same positional indices, the model attends by index rather than semantics.

NegRoPE assigns target tokens positive spatial indices (+n) and source tokens negative indices (−n). Because the RoPE of −n is the complex conjugate of +n, this elegantly separates the two streams with zero architectural modification or per-video fine-tuning.

Why Sierpinski? — Multi-Scale Trackability

Without dome texture (sparse in background)

With Sierpinski dome (dense motion cues everywhere)

The Sierpinski fractal provides structural details at both near and far views thanks to its self-similar, multi-scale nature — unlike checkerboard or single-scale patterns that degrade at large viewpoint changes.

Qualitative Comparison

We compare against implicit methods (ReCamMaster, ReDirector) and explicit method (TrajectoryCrafter) on DAVIS videos with challenging camera trajectories. Implicit methods tend to keep objects anchored even when they should leave the frustum; TrajectoryCrafter fails in sparse-guidance regions. SierpinskiCam faithfully follows the target trajectory while preserving scene dynamics.

ReCamMaster ReCamMaster BMX Trees