### 1. Introduction

This is a neat little paper that the inimitable Barabasi published at Neurips 2020 contains several great ideas for how we can push our theoretical understanding of graph representation learning. As an added bonus, a colleague of mine pointed out the fun Northeastern - Northwestern collaboration on the authorship. This paper builds on a body of literature in recent years which have pointed out that despite empirical success of GNNs, theoretical understand of what and how they learn lags. The way this paper seeks to extend theoretical understanding is by looking at how different GNN configurations perform at capturing statistical properties of the graph (graph moments). The idea is that an algorithm embedding a graph is more expressive if you can recover the graphs moments from the produced embeddings. Perhaps more interesting is when you view this as a diagnostic tool more than a evaluation metric. That is because how an algorithm performs in terms of recovering a single moment vs another might tell you what information that algorithm has used and what information it has discarded.

### 2. Background Material

Let’s take a bit to ground this paper as for readers who aren’t familiar with papers on random graphs there might be some things in here that are strange to you. The idea of a random graph is that for any graph there is some underlying data generating process that determines the placement of edges. In the case of natural graphs (e.g. graphs we have through collecting relationship data like social networks) the data generating process is unobserved. However, for random graphs we can define a data generating process. This paper uses two of the most common ones to generate the graphs that they test on. The Barabasi-Albert algorithm and the Erdos-Renyi algorithm. In both cases the model starts with some nodes and as new nodes are added edges are added to existing nodes with some probability. The Barabasi-Albert graph is a preferrential attachment model, meaning that a node will more likely get a new edge if it already has a high proportion of the total edges. This creates a hub style network with a powerlaw nature to the degree distribution. On the other hand in an Erdos-Renyi model the edge gets added with a fixed probability which is the same for every node. This paper also refers to a modification of the Barabasi-Albert graph known as the configuration model. The configuration model seeks to maintain the same degree distribution per node but completely re-wire the graph to create a fake graph that looks like the original but hasn’t been created by the same model.

The concept of graph moments arise from the random graph paradigm. Much like how you might compute the summary statistics of tabular dataset, each moment represents some summarization of the graph’s properties. So in the cast of the first moment, it is also the degree distribution. The k-order moments referred to in this paper are the total number of paths of length K along which you can get to node i averaged over all nodes.

Finally, the paper refers to three different ways of operating on a graph. The adjacency matrix and two other variants of the adjacency matrix can be seen below:

### 3. Theory

In terms of advancing the theoretical understanding of Graph Neural Networks two things jumped out to me. Firstly, they start by analyzing a fully connected model operating directly on the adjacency matrix and compare them to graph convolutions operating on the normalized and symmetrized adjacency matrices respectively. The graph convolutions prove to be more efficient than a fully connected network. However, they have very specific limitations. Primarily, they quantify the relationship between the depth of a gcn model and the order of moment it is able to learn. Namely, a GCN can’t learn a graph moment with fewer layers than the order of the moment. On its own it can learn an equivelent moment for each layer for n = p, or n > p if bias is included.

The other thing they point out is that each of the formulations of the adjacency capture different information and that combining all three of them as an input to a model along with a residual connection for each layer adds significant expressiveness to the model. Their proposed model can be seen below.

### 4. Results Highlights and Discussion

Using this model they compare the ability to capture the first three moments of a random graph. They test the impact of depth and activation function. Primarily, they show that for higher moments shallow models do not perform well with optimal performance typically achieved with three to four layers. In terms of activation function it appears that linear activations actually perform the best. However, the test loss jumps around quite a lot in later epochs. The sigmoid activation function has a slightly lower performance from the linear activation but stability is greatly improved.

A secondary evaluation was conducted to determine the model’s ability to distinguish between Erdos-Renyi graphs and Barabasi Albert graphs as well as distinguish between Barabasi-Albert graphs and the configuration model. Here too the deeper the model the better it performed in distinguishing between these two random models.

In summary, while they introduce a new model and a new way of evaluating the representation power of a graph representation learning algorithm, this paper still felt a little light in terms of the theoretical analysis. Just pointing out that standard GCNs fail to capture higher order graph moments, doesn’t explain why. Creating a more complicated model architecture that can capture higher order graph moments only serves to add another model which needs to be studied theoretically.

### Citations & Further Reading

- Dehmamy, Nima, Albert-László Barabási, and Rose Yu. “Understanding the Representation Power of Graph Neural Networks in Learning Graph Topology.” Advances in Neural Information Processing Systems. 2019.