DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models


Project Lead & Corresponding Author

Abstract

Discrete diffusion models have achieved success in tasks like image generation and masked language modeling but face limitations in controlled content editing. We introduce DICE (Discrete Inversion for Controllable Editing), the first approach to enable precise inversion for discrete diffusion models, including multinomial diffusion and masked generative models. By recording noise sequences and masking patterns during the reverse diffusion process, DICE enables accurate reconstruction and flexible editing of discrete data without the need for predefined masks or attention manipulation.We demonstrate the effectiveness of DICE across both image and text domains, evaluating it on models such as VQ-Diffusion, Paella, and RoBERTa. Our results show that DICE preserves high data fidelity while enhancing editing capabilities, offering new opportunities for fine-grained content manipulation in discrete spaces. Code is available at [link] .

Illustration of the limitation of masked inpainting method

Here, we want to change the cat to a dog. Inpainting with masked generation inadvertently modifies the orientation of the head, resulting in a less favourable result. With our discrete inversion, we are able to edit the image while preserving other properties of the object being edited. This is achieved by injecting the information from the input image into the logit space. Dotted red box indicates the mask.

Input Image, Inpainting with Mask and Our Inversion-based method without mask

ODE-Based and Non-ODE based editing and reconstruction paradigms

Input Image, Inpainting with Mask and Our Inversion-based method without mask

(a, c) shows the ODE-based editing and reconstructions. While it provides accurate editing and reconstruction performances, it highly depends on the underlying ODE trajectory, which is not feasible in the discrete diffusion. However, the Non-ODE editing samples a trajectory by directly adding noise to x0 and records the difference between the predicted xt-1 and the sampled xt-1, as indicated by the red arrow. In this way, we are able to reconstruct/edit the image without the strong condition of having an underlying ODE.

Image Reconstruction Results

Multi-Method Visualization comparison
Reconstruction and editing result with Discrete Inversion and Paella.

Image Editing Results

Multi-Method Visualization comparison
Editing results for our method using Paella and VQ-Diffusion are presented, along with their corresponding prompts. The results demonstrate that our method can effectively modify the input image according to the target prompt while preserving the image structure. Editing with masked generative model (Paella) is more stable and easier than with multinomial diffusion models (VQ-Diffusion).

Compare with Masked Inpainting

Multi-Method Visualization comparison

Comparison of editing results between our method and masked inpainting. Our method can better preserve the original image structure and generate more realistic results.

Image Editing with Diversity

Multi-Method Visualization comparison

Due to the stochastic nature of our method, we can generate diverse outputs. The first three rows illustrate variations in both inversion masks and injected Gumbel noise (λ1 = 0.7, λ2 = 0.3). The last two rows demonstrate variations using only inversion masks (λ1 = 1, λ2 = 0).

Text Editing Results

Negative Prompt Our Edited Results
Negative Sentiment: This book is definitely interesting.
I can't wait to finish it; it's so predictable.
Positive Sentiment: This book is definitely interesting.
I can't wait to see it; it sounds so beautiful.
Negative Sentiment: The new office space is fantastic.
It's cramped and lacks proper facilities.
Positive Sentiment: The new office space is fantastic.
It's spacious and has great facilities.
Negative Sentiment: Despite her efforts.
The event was a complete disaster.
Positive Sentiment: Thanks to her efforts.
This event was a fantastic comedy game.
Negative Sentiment: Regarding the lecture.
It was dull and confusing.
Positive Sentiment: Regarding the lecture.
It was clear and surprising.
Negative Sentiment: Despite the initial problems.
The project ended in failure.
Positive Sentiment: Despite the initial problems.
New project still in progress.
Negative Sentiment: Regarding the new app.
It's complicated and not useful.
Positive Sentiment: Regarding the new app.
It's On and It's Epic.
Negative Sentiment: Reflecting on my environmental initiatives.
It's challenging to maintain, and progress is slow.
Positive Sentiment: Reflecting on my environmental initiatives.
It's easy to understand, and progress is undeniable.

Table: Editing results of our method with RoBERTa. The sentences in black are the prompts used for inversion and editing in their respective column. The sentence in red is the one being inverted, and the blue sentence represents the editing result.

BibTeX

@article{he2024dice,
        title={DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models},
        author={He, Xiaoxiao and Han, Ligong and Dao, Quan and Wen, Song and Bai, Minhao and Liu, Di and Zhang, Han and Min, Martin Renqiang and Juefei-Xu, Felix and Tan, Chaowei and Liu, Bo and Li, Kang and Li, Hongdong and Huang, Junzhou and Ahmed, Faez and Srivastava, Akash and Metaxas, Dimitris},
        journal={arXiv preprint arXiv:2410.08207},
        year={2024}
      }