Convolutional Variational Autoencoder as Image Similarity Metric

This is a demonstration of a convolutional variational autoencoder model trained on image data representing the state of non-linear differential equations modeling material morphology evolution during fabrication process of organic thin films.  The ultimate goal of this effort is to find a way to compare two frames from our distribution relative to high-level semantics.  This work should be considered a proof of concept.


The dataset consists of 57 sets of trajectories each with an average of 255 frames of their state over time. The data is grayscale images (with pixel intensities ranging from 0 to 1) which are each 400 pixels in width and 100 pixels in height. In total, their are 14,541 images in the dateset.

The data is split randomly into training and testing sets at a 80/20 ratio (respectively). The data is augmented by reflecting over the y-axis as well as rolling the image along the x-axis. The images is also binarized, so that pixel less than 0.5 are set to 0 otherwise to 1.

Animated trajectory sample for BR0.54-CHI2.4

Convolutional Variational Autoencoder

The model is a convolutional variational autoencoder (CVAE).  An autoencoder is a neural network architecture which works to map data to a compact latent space in an unsupervised manner.  A variational autoencoder (VAE) provides a probabilistic manner for describing the latent space.  It being convolutional means that it can take images as input and preserve the spatial structure of the data.  Below is a basic schematic for a CVAE:

Convolutional Variational Autoencoder Architecture



Our model's encoder consists of 2 convolutional layers and 2 dense layers, 16 latent points (16 means and 16 standard deviations), and a decoder with 2 dense layers and 2 convolutional layers.  The architecture and hyperparameters chosen were only coarsely tuned, and is the model likely lacks enough capacity to optimally learn the data.  But the model is suitable for the sake of feasible experimenting.


Stochastic gradient decent with the Adam optimizer was used to train the model by maximizing the evidence lower bound (ELBO).  After about 4000 epochs of training, the learning rate of the Adam optimizer was decreased from 1e-3 to 1e-4 to overcome stalling in training.  Again, better tuning of hyperparameters can be used to optimize the training process.

Training occurred, in total, over 10,000 epochs, which ran for about 3 hours on a GTX 1070ti, improving from a starting ELBO of ~25000 to an ELBO of ~7500.

Plot of ELBO over training. Note the bump at around ~4000 epochs when the learning rate was reduced from 1e-3 to 1e-4.


With a trained model, the following image from the dataset:

Original image from the dataset.

is passed through the encoder and the decoder to produce the following image:

Decoded image.

and with binarization:

Binarized decoded image.

The model, then, is able to replicate images with reasonable accuracy, which indicates that our model has learned something.

t-SNE Visualization

Our goal is to use this model as a function capable of measuring the similarity between two images sampled from this distribution.  The model we trained is able to map these high-dimensional images into a small latent space, and the hope is that it will have been forced to an efficient mapping based on high-level features (rather than pixel-level features).

In order to illustrate this, t-distributed stochastic neighbor embedding (t-SNE) is used to embed our 16 latent variables into 2 dimensions, which will hopefully show patterns which indicate our model has learned a mapping we are looking for.  Below is an interactive plot of a subset of dataset (1/25th the frames of each trajectory) of the results of our t-SNE visualization.  Each color is it's own trajectory, and clicking on a point will render the corresponding image.

Notice the trajectories forming distinct paths from the center pointing outward radially.  It seems, after some quick looking around, that the center of the main cluster contains more of the chaotic frames, and the further away you get radially the simpler they get.  It would also seem, generally, frames which look similar (at a high-level) are close to one another.  Also notice the second, smaller, denser cluster on the outskirts of the plot.  These are all empty frames, and their position is likely arbitrary and an artifact of t-SNE.

I want to note that I am not very familiar with the inner-workings of t-SNE, and it's results are notoriously tricky to interpret.  However, I feel this is a case of the proof being in the pudding, as the above plot demonstrates some clear (and reproducible) patterns confirming our hypothesis.  Still, I caution anyone to look too deeply into this particular visualization.

Nearest Neighbor Finding Tool

Our model seems capable of mapping these frames into a somewhat spatially-meaningful representation, so we should be able to examine the Euclidean distance between the latent encodings of frames to determine how similar two images are.

The below tool demonstrates this by finding the nearest neighbor (form our 1/25th dataset) to a user-provided image (the tool isn't very robust; I suggest finding a simple image from the t-SNE visualization and try replicating it).

You should find that the tool can sometimes present frames which are similar to the user-provided image from our limited dataset.  This indicates that distance between these encodings is relatively small, which is what we're looking for.

Further Work

Given the very limited dataset, very low-capacity model used in this demonstration, I think the results a promising in terms of possibly utilizing an approach such as this in our efforts.  I would consider this instance of the model to be magnitudes more limited than what we can potentially build, and with just a larger dataset we might be able to start using it directly.

Proper infrastructure is needed to support proper training and tuning, including better pipelines, better (and possibly distributed) hardware, and proper development tools.  This demonstration is running out of a Docker container, which should (in theory) be easy to configure to scale and host remotely.  This should scale up easily, and can possibly be directly incorporated in uses such as deep reinforcement learning.

Some more planning can also go into the design of the model (and it's hyperparameters), the flow and coordination of training, and how it can be used.  AutoML may be used to do a lot of the meta-training, and some more research into other approaches, such as GANs, may help us find a better solution.

Show Comments