Bringing Objects to Life: 4D generation from 3D objects

Bringing Objects to Life: 4D generation from 3D objects

1Bar-Ilan University 2NVIDIA

* Equal Contribution.

3D Mesh

Elephant static

"An elephant is shaking its trunk"

Mario running

Result 4D

Mario running

Elephant static

"A plant blooming"

Mario running

Mario running

Our method, 3to4D, takes a passive 3D object and a textual prompt describing a desired action. It then adds dynamics to the object based on the prompt to create a 4D animation, essentially a video viewable from any perspective. On the right, we display four 3D frames from the generated 4D animation.

Abstract


Recent advancements in generative models have enabled the creation of dynamic 4D content — 3D objects in motion—based on text prompts, which holds potential for applications in virtual worlds, media, and gaming. However, existing methods provide limited control over the appearance of generated content. In this work, we introduce a method for animating user-provided 3D objects by conditioning on textual prompts to guide 4D generation, enabling custom animations while maintaining the original object's identity. We first convert a 3D mesh into a static 4D Neural Radiance Field (NeRF) that preserves the object’s visual attributes. Then, we animate the object using an Image-to-Video diffusion model driven by text. To improve motion realism, we introduce an incremental viewpoint selection protocol for sampling perspectives to promote lifelike movement and a masked Score Distillation Sampling (SDS) loss, which leverages attention maps to focus optimization on relevant regions. We evaluate our model on temporal coherence, prompt adherence, and visual fidelity, and find that our method outperforms baselines based on other approaches, achieving up to threefold improvements in LPIPS scores, and effectively balancing visual quality with dynamic content.


Overview


3D Mesh

Mario static

Mario static

Mario running

Mario running

Mario jumping

Mario jumping

Mario waving

Mario jumping

Mario walking

Mario walking

Left: we display the input object. Right: The resulted 4D object viewed from azimuth -60° → 60°, progressing over time.



Instead of generating a 4D dynamic object using text control only, one may want to animate an existing 3D object, like your favorite 3D toy or character. Conditioning 4D generation on 3D assets offers several advantages: it enhances control, leverages existing 3D resources efficiently, and accelerates 4D generation by using 3D as a strong initialization. Despite the availability of extensive, high-quality 3D models, current methods have not yet used 3D assets to guide 4D generation.

We introduce a novel method for generating 4D scenes from user-provided 3D representations, called 3to4D, taking a simple approach that incorporates textual descriptions to govern the animation of the 3D objects. First, we train a "static" 4D NeRF based on the 3D mesh input, effectively capturing the object appearance from multiple views, replicated across time. Then, our method modifies the 4D object using an image-to-video diffusion model, conditioned the first frame on renderings of the input object.

Unfortunately, we find that applying this approach naively is insufficient, because it dramatically reduces the level of dynamic motion. To encourage the model to generate more dynamic movements, we propose two key improvements. First, a new camera viewpoint selector that incrementally samples different viewpoints around the object during optimization. This gradual-widening sampling approach enhances the generation process, resulting in more pronounced movement. Second, we introduce a masked variant of the SDS loss, using attention maps obtained from the Image-to-Video model. This masked SDS focuses the optimization on object-relevant areas of the latent space pixels, enhancing the optimization of elements related to the object.

Pipeline


Workflow of our 3to4D, designed to optimize a 4D radiance field using a neural representation that captures both static and dynamic elements. First, a 4D NeRF is trained to represent the passive object (plant, left), having the same input structure at each time step. Then, we introduce dynamics to the 4D NeRF by distilling the prior from a pre-trained image-to-video model. At each SDS step, we select a viewpoint and render both the input object and the 4D NeRF from the same selected viewpoint. These renders, along with the textual prompts, are then fed into the image-to-video model, and the SDS loss is calculated to guide the generation of motion while preserving the object's identity. The attention-masked SDS, focuses learning on the relevant parts of the object, improving identity preservation.

Results


For a collection of 3D objects, we used the Google Scanned Objects (GSO) dataset. This is a collection of high-quality 3D scans of everyday items.


3D Mesh

Elephant static

"the hulk is smashing"

Mario running

3D Mesh

Elephant static

"A plant blooming"

Mario running

3D Mesh

Elephant static

"A hand bell ringing"

Mario running

3D Mesh

Elephant static

"An elephant is shaking its hears"

Mario running

3D Mesh

Elephant static

"A walking heel shoe"

Mario running

3D Mesh

Elephant static

"An elephant is trumpeting with its trunk"

Mario running

3D Mesh

Elephant static

"Smoke comes out of the train"

Mario running

3D Mesh

Elephant static

"The truck extends its arm to lift the cargo"

Mario running

3D Mesh

Elephant static

"A turtle has its head inside its shell"

Mario running

3D Mesh

Elephant static

"A honey dipper drizzle honey"

Mario running

Example arranged in two columns, where each column has the following structure: On the left, we display the input 3D passive object. On the right, we present a video of generated 4D object, viewed from azimuth -60° → 60°, progressing over time. The title represnt the input prompt.

Citation



    @article{rahamim2024bringingobjectslife4d,
        title={Bringing Objects to Life: 4D generation from 3D objects}, 
        author={Ohad Rahamim and Ori Malca and Dvir Samuel and Gal Chechik},
        year={2024},
        eprint={2412.20422},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2412.20422}, 
  }