### 1. Introduction

Auto-encoders are common techniques for learning low dimensional representations of continuous data such as images. Variational auto-encoders are an interesting, and in some cases powerful, approach to improving the quality of traditional auto-encoders. Many forms of graph representation learning can be viewed as a type of auto-encoder. This pithy paper by Kipf and Welling seeks to take that success and apply it to graph representation learning. Specifically, apply the convolutional neural net based variational auto-encoder approach to graph convolutional networks.

### 2. Background on Variational Auto-Encoders

Let’s start by getting an intuitive sense of what a variational auto-encoder is and why its useful. For an in-depth introduction I’d suggest this blog post. I am using their figures because they are useful. As can be seen in Figure 1, an auto-encoder in its simplest sense is two stacked neural networks – an encoder and a decoder. The objective of the model is for the encoder to learn a mapping of the data into a lower dimensional space in such a way that the decoder can learn to take this lower dimensional representation and recreate the original image.

A commonly observed limitation of autoencoders is that the lower dimensional space into which the encoder learns to map the data is not continuous across class boundaries. In domain areas where a generative model may be desired, say you want to know what an image that is a mix of a 3 and 7 looks like, you may not be able to recover this from a traditional autoencoder. In a sense a variational auto-encoder seeks to impose a shape on the learnable low dimensional space.

The simplest way to do this, is to have final layer of the encoder parameterize distributions over those values. The decoder now has to learn to decode values from the samples of this distribution. Now for any input to the encoder you can recover a distribution of where that data point resides in the latent space. This adds smoothness to the latent space but does not on its own create the kind of continuity we discussed above. The parameters of these distributions can still move unconstrained. In order to provide this constraint, an extra term is added to the loss function of a traditional auto-encoder. Namely, the variational auto-encoder seeks to minimize reconstruction error, while at the same time minimizing the Kullback-Liebler (KL) Divergence between the proposed distribution and the standard normal distribution. In simple terms, this forces the low-dimensional space to roughly take the shape of the standard normal (or whatever the variational distribution you choose).

### 3. Background on Graph Convolutional Networks

Graphs are a common structure of data in the world. Anything from social networks, financial transactions to transportation systems can be represented as a graph. Because they tend to be extremely high dimensional, sparse and not particularly euclidean, it took some time for researchers to begin applying neural network models to these types of data structures. However, that time has come and there have been a bevy of great works in the last few years. To name a few, Bruna et al., ICLR 2014, Henaff et al., 2015 Duvenaud et al., NIPS 2015, Li et al., ICLR 2016, Defferrard et al., NIPS 2016, and Kipf & Welling, ICLR 2017. I hope to cover some of these in future posts.

The basics of a graph convolutional network are as follows. The models typically require two matrices as input. An N X D feature matrix, comprised of all the features D for every node in the graph N. The second matrix is often an adjacency matrix, or some close approximation thereof. The use of this second matrix is to capture the topology of the network. The first layer of the network takes in the adjacency matrix and the feature matrix and generates another feature matrix. This process can be layered multiple times with the final layer producing a feature matrix corresponding to a node’s label. There are a lot of nuances to how to build these layers and this excellent blog goes into some of them.

### 4. Variational Graph Auto-Encoders

So now we want to put these two ideas together to create a variational graph auto-encoder. First let’s start with a vanilla graph auto-encoder. Much like a traditional image auto-encoder you stack two convolutional networks, a encoder and a decoder. However, in this case each of the two networks are graph convolutional network. To put it most simply, you use a graph convotional network to learn a node embedding for every node in your network. To do this the model takes in an N X N adjacency matrix of the original graph and learns an N X Z matrix such that the pairwise inner products of every Zi, Zj gets as close to the (0, 1) values in the input adjacency matrix for that Ai, Aj. This can be seen in the two equations below

Finally, to learn the variational graph auto-encoder you have to add two components. Firstly, we need to learn the parameters of the distribution over the latent states as can be seen below. The encoder GCN parameterizes a multivariate Gaussian for every node in the graph. They constrain the covariate matrix of this Gaussian to be diagonal for computational simplicity.

The second component is adding the Kullback-Leibler divergence between each of the latent state distributions and the standard normal distribution to the loss function. This forces the learned embeddings to conform to the shape of the variational distribution. In this case it forces the embeddings to reside in a roughly spherical space as can be seen in Figure 4.

Mind the signs here, technically you want to maximize the log-probability of the pairwise inner products, while also minimizing the KL divergence. Therefore in a standard training format you would flip the signs for both of these values.

### 5. Closing Thoughts

Other than showing that VGAE’s are good for link prediction, this paper doesn’t really explain why you would want to use a VGAE. Learning a latent space that conforms to the standard normal distribution shape is useful in the image representation task. However, it is not clear to me what additional benefits are granted by doing this in the graph context. Perhaps the interpolation task is important for some use cases. Given two nodes A and C, we want to interpolate an intermediary node and generate what we would expect its neighbors and features to look like.