Infusion: Internal Diffusion for Video Inpainting

1Télécom Paris, 2MAP5, 3Sorbonne Université

We propose a diffusion-based video inpainting model. We train our model on a single video relying on the internal content and self-similarity to fill in the masked region. Our approach leverages the diffusion framework which is particularly interesting for dynamic textures but does not require an oversized network.


Video inpainting is the task of filling a desired region in a video in a visually convincing manner. It is a very challenging task due to the high dimensionality of the signal and the temporal consistency required for obtaining convincing results. Recently, diffusion models have shown impressive results in modeling complex data distributions, including images and videos. Diffusion models remain nonetheless very expensive to train and perform inference with, which strongly restrict their application to video. We show that in the case of video inpainting, thanks to the highly auto-similar nature of videos, the training of a diffusion model can be restricted to the video to inpaint and still produce very satisfying results. This leads us to adopt an internal learning approch, which also allows for a greatly reduced network size. We call our approach "Infusion": an internal learning algorithm for video inpainting through diffusion. Due to our frugal network, we are able to propose the first video inpainting approach based purely on diffusion. Other methods require supporting elements such as optical flow estimation, which limits their performance in the case of dynamic textures for example. We introduce a new method for efficient training and inference of diffusion models in the context of internal learning. We split the diffusion process into different learning intervals which greatly simplifies the learning steps. We show qualititative and quantitative results, demonstrating that our method reaches state-of-the-art performance, in particular in the case of dynamic backgrounds and textures.

Interval training

To improve the results while keeping the network's size relatively small we propose Interval training. We propose to use one lightweight network trained only on a subset of timesteps at a time. We train our model on a given interval. Once training has finished on this interval, we use the model to infer the beginning of the next time step. At this point, the model is used for training this next time step. This is carried out until we have reached time step t = 0.

Interval training splits the training of the diffusion model into multiple intervals Inpainting results - baseline vs interval training on 2D textures



      title={Infusion: Internal Diffusion for Video Inpainting}, 
      author={Nicolas Cherel and Andrés Almansa and Yann Gousseau and Alasdair Newson},