Multi Modal Variational AutoEncoders

Variational Mixture-of-Experts Autoencoders for Multi Modal Deep Generative Models

Intro

They characterize successful learning of such models as fulfilment of four criteria - i) implicit latent decomposition into shared and private subspaces, ii) coherent joint generation over all modalities, iii) coherent cross-generation across individual modalities, and iv) improved model learning for individual modalities through multi-modal integration.

image-20200818150052698

When we take all modalities(visual, linguistic and physical) into account, we can get a better understanding and representation of context, for ex. here bird.

image-20200818150435770

image-20200818150445361

  • Latent Factorization - There is some shared and private spaces across different modalities.
  • Joint Generation - From union of both spaces we could generate different modalities. Basically, for example, text and image should be semantically same.
  • Cross Generation - Model can generate data in one modality conditioned on some other modality. For ex. From text, generate an image.
  • Synergy - Observing both modalities should enhance context understanding. For example observing image and text should improve in specificity of generation of image and text when taken alone.

image-20200818151150020

image-20200818151230522

There has been some work in the area of joint representations for example

Suzuki et al. (2017) introduced the joint multimodal VAE (JMVAE) that learns shared representation with joint encoder qΦ(z | x1, x2). To handle missing data at test time, two unimodal encoders qΦ(z | x1) and qΦ(z | x2) are trained to match qΦ(z | x1, x2) with a KL constraint between them.

However, they only focuses on one way transformation that too image-to-image and often require additional modelling components.

More recently, Wu and Goodman (2018) introduced Product of Experts(PoE) over marginal posteriors, enabling cross model generations at test time without requiring additional inference networks and multi-stage training regimes.

Some Mathematics

We denote modalities with m = 1,2,3……M and latent representation as z.

VAE is of form : image-20200818152456926

VAEs are parametrized by $\theta$ which is deep neural network.

The objective of training VAEs is to maximise the marginal likelihood of the data pΘ(x1:M). However, computing the evidence is intractable as it requires knowledge of the true joint posterior pΘ(z | x1:M). To tackle this, we approximate the true unknown posterior by a variational posterior qΦ(z | x1:M), which now allows optimising an evidence lower bound (ELBO) through stochastic gradient descent (SGD), with ELBO defined as

image-20200819173629938

They further used IWAE estimator to get a tighter bound and higher entropy.

Here too, we are facing one issue, how to get joint posterior q$\phi$? One basic way is get a single encoder that takes all modalities as input. But that would mean, we need all modalities present at test time which won’t be true for cross modal generation.

So instead they propose to factorise joint variational posterior as a combination of unimodal posteriors(weighted sum) :

image-20200819174049519

When using Product-of-Experts, any one modality hold the power of veto.

For better understanding of VAEs both from Neural Network and probabilistic perspective, you can follow these links:

Results

As this is a new idea and theory based paper, they have done experiments on MNIST-SVHN data and Caltech Bird dataset(challenging). You can view results here on all four characteristics mentioned above.

Shivansh Mundra
Shivansh Mundra
Pre-Final Year Undergraduate Student

Related