Neighbors Matter: How Homophily Shapes Graph Neural Networks

Neighbors Matter: How Homophily Shapes Graph Neural Networks

Given the wide range of configuration choices, setting up a Graph Neural Network can be daunting. Wouldn’t it be helpful to have a criterion that at least filters out less promising architectures?

Node and edge homophily ratios might offer just that guidance.

What you will learn: How to compute graph homophily and use it to guide the design of your Graph Neural Network


The complete article, featuring design principles, detailed implementation, in-depth analysis, and exercises, is available at Homophily to Shape Graph Neural Networks


🎯 Overview

Imagine being able to analyze a dataset upfront to determine the optimal configuration for a Graph Neural Network. Homophily provides a principled way to assess whether proximity in the graph structure—through nodes or edges—correlates with similarity in labels or classes.


⚠️ I strongly recommend to review my articles introducing PyTorch Geometric Taming PyTorch Geometric for Graph Neural Networks and Graph loaders Demystifying Graph Sampling & Walk Methods


What is Homophily?

In Graph Neural Networks (GNNs), homophily measures the tendency of nodes with the same label (or similar features) to be connected. High homophily graphs (e.g., citation networks) have neighbors likely sharing labels, while low homophily graphs (e.g., Amazon co-purchase graphs) might not.

The homophily of a graph characterizes how likely nodes with the same label are near each other in a graph.

Let’s start with the mathematical definition. Let G=(V,E) be a graph with a set of nodes (vertices) V, a set of edges E, label yv of node v and a neighbors of node v, N(v)., there are 3 types of homophily ratios:

Node homophily ratio: Average fraction of same-label neighbors per node.

Edge homophily ratio: The fraction of edges in a graph which connects nodes that have the same class label.

Class insensitive edge homophily ratio: Edge homophily is modified to be insensitive to the number of classes and size of each class. C number of classes, |Ck| number of nodes of class k and hk denotes the edge homophily ratio of class k.

In a nutshell, Edge homophily answers: "How likely is it that a random edge connects same-label nodes?" while Node homophily answers: "How homophilic is the neighborhood of a typical node?.


⚠️ Several factors can explain the discrepancy between node homophily and edge homophily ratios, including:

  • Edge homophily tends to be biased toward high-degree nodes, whereas node homophily gives equal weight to all nodes.

  • If a particular class consists primarily of high-degree nodes, it can artificially inflate the edge homophily due to class imbalance.

  • Node homophily typically excludes isolated nodes, which can further contribute to the difference.


Why Homophily?

Homophily is a key property of Graph Neural Networks that helps controlling:

  • Architecture decision

  • Depth of models

  • Aggregation method

  • Training strategy

Here is a rule of thumb:

Graph with high homophily 

  • Simple Graph Neural Network, Graph Convolutional Network or Graph Attention Networks works best

  • No more than two GNN layers are necessary

Graph with low homophily 

  • Heterophily-aware Graph Neural Networks

  • Features transformation prior to aggregation

  • Residual connections

Fig. 1 Illustration of impact of Homophily on the design of a Graph Convolutional Network

In this study, we treat the distribution of node neighbor sampling as an indicator of model complexity. In the example below, the random walk used to sample neighbors of node '0' may span 1, 2, 3, or more hops.

Fig. 2 Illustration of random walk/sampling of neighboring nodes from node ‘0’

We define model complexity arbitrarily as a combination of the number of hops and the maximum number of nodes randomly selected at each hop.

Fig. 3 Illustration of complexity of random sampling of neighboring nodes

This definition can be extended by incorporating additional factors, such as the number of Graph Convolutional layers, as illustrated earlier.

⚙️ Hands-on with Python

Environment

  • LibrariesPython 3.11.8, PyTorch 2.1.0, PyTorch Geometric 2.6.1, Optuna 4.2.0

  • Source code is available at Github.com/patnicolas/geometriclearning/dataset/graph

  • To enhance the readability of the algorithm implementations, we have omitted non-essential code elements like error checking, comments, exceptions, validation of class and method arguments, scoping qualifiers, and import statements.

Graph Convolutional Neural Networks


📌📌This section does not provide a theoretical review of Graph Neural Networks (GNNs). Instead, there are numerous high-quality tutorials and detailed explanations available in both printed resources and online materials. I recommend the following references 123, & 4 as tutorial..


Introduction

Graph Neural Network (GNN) is an optimizable transformation on all attributes of the graph (nodes, edges, global context) that preserves graph symmetries (permutation invariances). GNN takes a graph as input and generate/predict a graph as output.

Data on manifolds can often be represented as a graph, where the manifold's local structure is approximated by connections between nearby points. GNNs and their variants (like Graph Convolutional Networks (GCNs) extend neural networks to process data on non-Euclidean domains by leveraging the graph structure, which may approximate the underlying manifold.

Our Evaluation Model

For our study we use a Graph Convolutional Neural Network defined in a previous article, How to Tune a Graph Neural Networks - Hyperparameters Tuning Frameworkwith the following configuration:

  • 2 Graph Convolutional blocks with ReLU activation and no pooling

  • 1 fully connected output block with LogSoftmax activation.

Fig. 4 Illustration of a 2 graph convolutional layer model used in this study

We apply the neighbor node sampling method, described in Demystifying Graph Sampling & Walk Methods - Neighbor Node Sampling to embed and aggregate node attributes from neighboring nodes with the following configuration:

Datasets

PyTorch Geometric (PyG) offers a comprehensive collection of graph datasets across various domains [ref & 6]. These graph datasets are accessible through dedicated class in the torch_geometric.datasets module [ref 7]. 

In this study we analyze the following data sets

  • Cora: A standard benchmark dataset for semi-supervised node classification, containing 2,708 nodes (scientific publications) and 5,429 edges (citations). Each node is described by a 1,433-dimensional feature vector. This dataset is also included in torch_geometric.datasets.Planetoid class collection.

  • CiteSeer: A citation network containing 3,312 scientific publications classified into six categories. The network contains 4,715 edges, with each node represented by a very large 3,703-dimensional feature vector. This dataset is also included in torch_geometric.datasets.Planetoid class collection.

  • PubMed: Consists of 19,717 scientific publications from the PubMed database, each pertaining to diabetes and classified into one of three classes. The citation network includes 44,338 edges, and each node has a 500-dimensional feature vector. This dataset is also included in torch_geometric.datasets.Planetoid class collection.

  • Wikipedia: contains 11,701 nodes, 216,123 edges, 10 classes and 20 different training splits.

  • Flickr: Contains descriptions and common properties of 89,250 images along with899.756 edges and a 500-dimensional feature vector. It is defined in torch_geometric.datasets.Flickr class.

Node & Edge Homophily

We begin by defining an enumeration, GraphHomophilyType, to represent the four categories of homophily ratios described earlier.

The GraphHomophily class, shown in code snippet 1, encapsulates the computation of node homophily, edge homophily, and class-insensitive edge homophily. It provides two constructors:

  • init: The default constructor, which initializes the class with a given graph data and the selected type of homophily computation.

  • build: An alternative constructor that creates an instance using a predefined PyTorch Geometric dataset along with the specified homophily computation type.

The actual computation is performed in the call dunda method which invokes the PyTorch Geometric homophily function 

We include a homegrown implementation of the computation of the edge homophily (Code snippet 2) for reference.

We compute the various homophily ratios for the 5 PyTorch Geometric datasets: CoraPubMedCiteSeerWikipedia and Flickr.

Fig. 5 Bar Chart for Node, Edge and Class Insensitive Edge Homophily for 5 PyTorchGeometric Datasets

We use Cora as a representative high-homophily dataset and Flickr to illustrate a low-homophily scenario.

Training & Validation

The training and validation of our model are described in details in Plug & Play Training of Graph Neural Networks

As a reminder we used the following hyperparameters:

The validation records and display the 4 metrics (Accuracy, Precision, Recall, F1), training and validation loss for various Node Neighbor Sampling distributions:

  • Single hop - 4 neighbors [4]

  • Two hops - First: 4 neighbors, Second: 2 neighbors [10, 4]

  • Three hops - First: 12 neighbors, Second: 6 neighbors, Third: 3 neighbors [12, 6, 3]

Fig. 6 Plots of metrics, training & validation loss for the distribution of number of sampled neighboring nodes [4] for Cora dataset

Fig. 7 Plots metrics, training & validation loss for the distribution of number of sampled neighboring nodes [10, 4] for Cora dataset

Fig. 8 Plots of 4 key metrics, training & validation loss for the distribution of number of sampled neighboring nodes [12, 6, 3] for Flickr dataset

Analysis

We select precision as our primary performance metric that we plot against the average of Node and Edge Homophily ratios for each of the 3 node neighbor sampling configuration for Cora and Flickr datasets.

Fig. 9 Scatter plot of precision vs. average (node, edge homophily) for Cora and Flickr data sets with 3 configuration of node neighbors sampling

The results show that increasing the complexity of our model—specifically through Node Neighbor Sampling distribution—enhances precision on the low-homophily dataset (Flickr), while having minimal effect on the high-homophily dataset (Cora).


The complete evaluation is available in the full article Homophily to Shape Graph Neural Networks


📘 References

  1. A Practical Tutorial on Graph Neural Networks I. Ward, J. Joyner, C. Lickfold, Y. Guo, M. Bennamoun - 2012

  2. A Comprehensive Introduction to Graph Neural Networks - Datacamp - 2022

  3. Graph Neural Networks: A Gentil Introduction - YouTube. A. Persson

  4. Stanford CS: Machine Learning with Graphs - YouTube - CS-224 Stanford University Online

  5. Github - PyTorch Geometric

  6. Taming PyTorch Geometric for Graph Neural Networks

  7. PyTorch Geometric Datasets


Patrick Nicolas has over 25 years of experience in software and data engineering, architecture design and end-to-end deployment and support with extensive knowledge in machine learning. He has been director of data engineering at Aideo Technologies since 2017 and he is the author of "Scala for Machine Learning", Packt Publishing ISBN 978-1-78712-238-3 and Hands-on Geometric Deep Learning Newsletter.


#GeometricDeepLearning #GraphNeuralNetwork #Homophily

To view or add a comment, sign in

Others also viewed

Explore topics