Online Clustered Codebook
ICCV 2023

Visual Geometry Group - University of Oxford

Abstract

Vector Quantisation (VQ) is experiencing a comeback in machine learning, where it is increasingly used in representation learning. However, optimizing the codevectors in existing VQ-VAE is not entirely trivial. A problem is codebook collapse, where only a small subset of codevectors receive gradients useful for their optimisation, whereas a majority of them simply “dies off” and is never updated or used. This limits the effectiveness of VQ for learning larger codebooks in complex computer vision tasks that require high-capacity representations. In this paper, we present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE). Our approach selects encoded features as anchors to update the “dead” codevectors, while optimising the codebooks which are alive via the original loss. This strategy brings unused codevectors closer in distribution to the encoded features, increasing the likelihood of being chosen and optimized. We extensively validate the generalization capability of our quantiser on various datasets, tasks (e.g. reconstruction and generation), and architectures (e.g. VQ-VAE, VQGAN, LDM). CVQ-VAE can be easily integrated into the existing models with just a few lines of code.

Runing Avearage updates

More Results

Results for Data Compression (Stage-1).

Reconstructions from different models. The two models are trianed under the same settings, except for the different quantisers. Compared with the state-of-the-art baseline VQGAN, the proposed model significantly improves the reconstruction quality (highlight in red box) under the same compression ratio.

Ablation results for Data Compression (Stage-1).

Results for Unconditional Image Generation (Stage-2).

256×256 image samples generated using the proposed quantiser, with model trained on LSUN bedroom (Top) and Church (Bottom).

Results for Class-conditional Image Generation (Stage-2).

Generated 256 × 256 images using our quantiser for class-conditional generation on ImageNet.

Citation

Acknowledgements

The website template was borrowed from Mip-NeRF.