MoVQ

MoVQ: Modulating Quantized Vectors for
High-Fidelity Image Generation
NeurIPS (Spotlight) 2022

Chuanxia Zheng
Monash University
Long Tung Vuong
VinAI
Jianfei Cai
Monash University
Dinh Phung
Monash University

Abstract

Although two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images, their quantization operator encodes similar patches within an image into the same index, resulting in a repeated artifact for similar adjacent regions using existing decoder architectures. To address this issue, we propose to incorporate the spatially conditional normalization to modulate the quantized vectors so as to insert spatially variant information to the embedded index maps, encouraging the decoder to generate more photorealistic images. Moreover, we use multichannel quantization to increase the recombination capability of the discrete codes without increasing the cost of model and codebook. Additionally, to generate discrete tokens at the second stage, we adopt a Masked Generative Image Transformer (MaskGIT) to learn an underlying prior distribution in the compressed latent space, which is much faster than the conventional autoregressive model. Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.

Video

Model Architecture

Left: The quantizer architecture of our proposed MoVQ. We incorporate the spatially conditional normalization layer into the decoder, where the two convolution layers predict modulation parameters γ and β in a point-wise way to modulate the learned discrete structure information. Right: Masked image generation. Here, a bidirectional transformer is applied to estimate the underlying prior distribution on the discrete representation with multiple channels.

More Results

Results for Data Compression (Stage-1).

Reconstructions from different models. The numbers denote the represented latent size and learned codebook sizes, respectively. Our model dramatically improves the image quality in the first stage

Results for Unconditional Image Generation (Stage-2).

256×256 image samples generated by the proposed MoVQ, with model trained on FFHQ

Results for Class-conditional Image Generation (Stage-2).

Generated 256 × 256 images by our MoVQ for class-conditional generation on ImageNet.

Citation

@InProceedings{zheng2022movq,
	author={Zheng, Chuanxia and Vuong, Long Tung and Cai, Jianfei and Phung, Dinh},
  title={MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation},
	booktitle={Thirty-sixth Conference on Neural Information Processing Systems},
	year = {2022},
}

Acknowledgements

The website template was borrowed from Mip-NeRF.

MoVQ: Modulating Quantized Vectors for
High-Fidelity Image Generation
NeurIPS (Spotlight) 2022

Paper

Code(Reproduced by Kandinsky2)

Video

Abstract

Video

Model Architecture

More Results

Results for Data Compression (Stage-1).

Results for Unconditional Image Generation (Stage-2).

Results for Class-conditional Image Generation (Stage-2).

Citation

Acknowledgements

MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation NeurIPS (Spotlight) 2022

Paper

Code(Reproduced by Kandinsky2)

Video

Abstract

Video

Model Architecture

More Results

Results for Data Compression (Stage-1).

Results for Unconditional Image Generation (Stage-2).

Results for Class-conditional Image Generation (Stage-2).

Citation

Acknowledgements

MoVQ: Modulating Quantized Vectors for
High-Fidelity Image Generation
NeurIPS (Spotlight) 2022