Graph Machine Learning - Past, Present, and Future -

1 KYOTO UNIVERSITY
KYOTO UNIVERSITY
GRADUATE SCHOOL OF INFORMATICS
Graph Machine Learning
- Past, Present, and Future -
Hisashi Kashima
Kyoto University

2 KYOTO UNIVERSITY
◼ Graph machine learning and graph signal processing have much in
common, but they have developed relatively separately
◼ History from the standpoint of graph machine learning, particularly in
that of predictive modeling, with some of the recent developments
◼ Past: The age of data mining and kernel machines
◼ Current: The age of graph neural networks
◼ Future?: Fusion with causal inference
Graph machine learning
Past, present, and future
Graph Machine Learning Graph Signal Processing
Today’s topic Global features
Local features

3 KYOTO UNIVERSITY
Graphs are versatile tools that model relationships between different entities using nodes (points) and edges (lines
connecting these points). In the real world, graphs represent various complex systems and interactions. Social
networks: Each person is a node, and their friendships or professional connections are edges. Graphs help identify
influential people, community structures, and the spread of information, aiding in marketing strategies and social
behavior research. Transportation networks: Cities or intersections are nodes, and roads or railways are edges.
Graphs optimize routes, manage traffic, and improve urban planning. Navigation apps use graph algorithms to
find the quickest routes. Biological networks: Proteins or genes are nodes, and their interactions are edges. These
graphs help understand cellular functions and disease mechanisms, guiding the development of targeted
therapies. Communication networks: Devices like computers or servers are nodes, and communication links are
edges. Graphs ensure efficient data transfer, robust network design, and better cybersecurity. Recommendation
systems: Users and products are nodes, and interactions (like purchases or ratings) are edges. Graphs enhance
recommendation accuracy, improving user experience on platforms like Amazon and Netflix. Graphs are crucial for
visualizing and analyzing complex relationships, optimizing processes, and improving decision-making across
various fields, from social media and urban development to biology and technology.
Graphs are everywhere!
# Do not read them seriously, texts and figures are generated by ChatGPT

4 KYOTO UNIVERSITY
Graphs are versatile tools that model relationships between different entities using nodes (points) and edges (lines
connecting these points). In the real world, graphs represent various complex systems and interactions. Social
networks: Each person is a node, and their friendships or professional connections are edges. Graphs help identify
influential people, community structures, and the spread of information, aiding in marketing strategies and social
behavior research. Transportation networks: Cities or intersections are nodes, and roads or railways are edges.
Graphs optimize routes, manage traffic, and improve urban planning. Navigation apps use graph algorithms to
find the quickest routes. Biological networks: Proteins or genes are nodes, and their interactions are edges. These
graphs help understand cellular functions and disease mechanisms, guiding the development of targeted
therapies. Communication networks: Devices like computers or servers are nodes, and communication links are
edges. Graphs ensure efficient data transfer, robust network design, and better cybersecurity. Recommendation
systems: Users and products are nodes, and interactions (like purchases or ratings) are edges. Graphs enhance
recommendation accuracy, improving user experience on platforms like Amazon and Netflix. Graphs are crucial for
visualizing and analyzing complex relationships, optimizing processes, and improving decision-making across
various fields, from social media and urban development to biology and technology.
Graphs are everywhere!
We Just Skip This Because
WE ALL
L VE
GRAPHS !!
# Do not read them seriously, texts and figures are generated by ChatGPT

5 KYOTO UNIVERSITY
The Age of Data Mining

6 KYOTO UNIVERSITY
◼ “Data mining” emerged in the 1990s
◼ originated from the database community
◼ with the aim of discovering knowledge from large databases
◼ Association rules: One of the major inventions of data mining
◼ Rules with the form “If 𝐴, then 𝐵” that satisfy
◼ Pr 𝐴 ∧ 𝐵 > 𝜃 : support constraint
◼ Pr 𝐵 𝐴) > 𝜂 : confidence constraint
◼ Example: “If buy burger ∧ buy(fries), then buy(soda)”
◼ Key technical challenge: How to enumerate all rules efficiently
Data Mining
Aiming for knowledge discovery from huge databases
∧
If
then
With high probability

7 KYOTO UNIVERSITY
◼ Itemset pattern mining problem:
Enumerate all itemsets (combinations of items) that frequently appear
in database
◼ Find all itemsets appearing more than 𝑘 times
◼ , , , , , , , , …
◼ Challenge: Exponential number of candidate itemset patterns exist
◼ We need to explore the huge space of the combinations efficiently
Itemset pattern mining
Discover all frequent item combinations

8 KYOTO UNIVERSITY
◼ Search space composition: Make a non-redundant search space
◼ To avoid evaluating the same patterns over and over again
◼ Search space pruning: Exploiting monotonicity
◼ If appears less than 𝑘 times, { , } will never appear 𝑘 times
◼ Explore smaller item sets first, larger item sets later
Techniques for itemset pattern mining
Composition and pruning of search space
A
If appears less
than 𝑘 times
Can be pruned

9 KYOTO UNIVERSITY
◼ Attempts to extend pattern mining to graphs began around 2000
◼ AGM algorithm: Seminal work by Inokuchi et al. (2000)
◼ What is “knowledge” in graph data mining? - Subgraphs
◼ Substructures are responsible for the nature of structural data
◼ Goal: Find all subgraph patterns appearing at least 𝑘 times
Graph Mining
Extension of frequent itemsets to graph data
(Quoted from Takigawa, I., & Mamitsuka, H. (2013). Graph mining: procedure, application to drug discovery and recent advances. Drug Discovery Today, 18(1-2), 50-57.)
Inokuchi, A., Washio, T., & Motoda, H. (2000). An apriori-based algorithm for mining frequent substructures from graph data.
In Principles of Data Mining and Knowledge Discovery (PKDD), 2000

10 KYOTO UNIVERSITY
◼ It is non-trivial to define an efficient search space for graphs
◼ Smart graph coding to avoid duplicate checking of isomorphic graphs
◼ AGM [Inokuchi et al.,2000] employed vertex sorting code
◼ gSpan [Yan&Han, 2022] employed depth-first search code
(and depth-first search using them)
Technical challenges in graph mining
Design of search space
(Quoted from Yan, X., & Han, J. (2002) gSpan: Graph-based substructure pattern mining. In IEEE International Conference on Data Mining (ICDM))

11 KYOTO UNIVERSITY
◼ Not explicitly oriented to prediction tasks (in machine learning)
◼ One exception is an interesting idea by Kudo et al. (2004):
Graph patterns are used as weak learners in the boosting algorithm
◼ Other limitations:
◼ Only discrete labels are assumed
◼ Node and edge labels are restricted to discrete
due to the construction of the discrete search space
◼ Continuous labels are usually discretized in advance
◼ High demands on computation and memory
Limitations of graph mining
Not explicitly oriented to prediction tasks
Kudo, T., Maeda, E., & Matsumoto, Y. (2004). An application of boosting to graph classification.
Advances in Neural Information Processing Systems, 17.

12 KYOTO UNIVERSITY
The Age of Kernel Machines

13 KYOTO UNIVERSITY
◼ Increased interest in graphs in machine learning,
where prediction is a major target:
◼ Graph classification, node classification, link prediction, …
◼ In data mining, knowledge discovery is the goal
◼ Using subgraphs as features in a predictive model seems like a natural
idea…, but the exponential number of features is still a barrier
Tasks that aim directly at prediction (and others)
Safe
Poisonous
Poisonous
or safe?
Prediction
Predictive
modeling
Subgraph features

14 KYOTO UNIVERSITY
◼ Support vector machine proposed by Cortes & Vapnik in 1995
◼ Linear model for data mapped to an ultra-high-dimensional feature
space: 𝑓(𝐱) = 𝐰⊤𝛟(𝐱)
◼ Equivalent to a non-linear model in the original space
◼ Can also be represented as a linear combination of kernel functions
𝑓(𝐱) = σ𝑖=1
𝑁
𝛼(𝑖) 𝛟(𝐱), 𝛟(𝐱(𝑖))
◼ “Kernel trick”: No matter how high (or infinite) the dimensionality of
feature space is, the dimensionality does not matter as long as the inner
product (= kernel function) can be computed efficiently
◼ An attractive framework with a high degree of freedom that is
applicable to any type of data as long as the kernel function is available
Kernel methods
Realization of nonlinear prediction via “kernel trick”
= Kernel function 𝑘 𝐱, 𝐱(𝑖)
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20, 273-297.

15 KYOTO UNIVERSITY
◼ Many attempts to design kernel functions for various data types
◼ Convolution kernels for structured data [Haussler, 1999]
◼ Break down target structured data into parts (substructures)
◼ Define kernel functions by accumulating similarities between parts
◼ Graph kernels: 𝑘 , = 𝛟( ), 𝛟( )
◼ The idea of convolution kernels is also applicable to graphs
◼ How to define 𝛟? - Natural idea is to use subgraphs as the parts
◼ Trade-off between expressiveness of a class of subgraphs
and computational efficiency needs to be considered
Graph kernels
Kernel methods for graph-structured data
Haussler, D. (1999). Convolution kernels on discrete structures.
Technical report, Department of Computer Science, University of California at Santa Cruz.

16 KYOTO UNIVERSITY
◼ Requirement: Trade-off between expressiveness of a class of
subgraphs and computational efficiency
◼ Random walk kernel [Kashima et al. (2003), Gärtner et al. (2003)]
◼ Use label sequences generated by random walks on graphs
◼ Infinite number of label sequences exist
◼ Can be computed in polynomial time by solving linear equations
◼ In practice, a few matrix multiplications with the power method
suffice
◼ Numerous extensions: tree-like patterns, small subgraphs, …
Random walk graph kernel
Infinite number of graph features in polynomial time
(5), (4, 3, 4), (4, 5, 2, 3), (1, 4), ….
Kashima, H., Tsuda, K., & Inokuchi, A. (2003). Marginalized kernels between labeled graphs. In ICML.
Gärtner, T., Flach, P., & Wrobel, S. (2003). On graph kernels: Hardness results and efficient alternatives. In COLT.

17 KYOTO UNIVERSITY
◼ Kernel trick elegantly solved the curse of dimensionality
◼ Feature dimensionality does not matter,
as long as the kernel function can be computed efficiently
◼ However, kernel trick also has created a new "curse of big data“
◼ Size of the problem/model depends on training data size 𝑁
𝑓(𝐱) = σ𝑖=1
𝑁
𝛼(𝑖)𝑘 𝐱, 𝐱(𝑖)
◼ Serious bottleneck when dealing with large data
◼ Some remedies were proposed, including compression of kernel
matrices …
The “curse” of kernel trick
Vulnerable to data size increase

18 KYOTO UNIVERSITY
◼ Now is the time to move from dual space back to primal space!
◼ Explicit feature composition in primal space by discarding kernel trick
◼ Weisfeiler-Lehman (WL) kernel [Shervashidze et al., 2009]
◼ Based on WL graph isomorphism test
◼ Each node obtains explicit feature representation of local structure
by message passing from neighborhood nodes
◼ (BTW, Hido & Kashima (2009) proposed an essentially same idea… )
Weisfeiler-Lehman (WL) kernel
Feature construction in primal space by message passing
Update
Shervashidze, N., Schweitzer, P., Van Leeuwen, E. J., Mehlhorn, K., & Borgwardt, K. M. (2011). Weisfeiler-lehman graph kernels.
Journal of Machine Learning Research, 12(9).
Hido, S., & Kashima, H. (2009). A linear-time graph kernel. In IEEE International Conference on Data Mining (ICDM).

19 KYOTO UNIVERSITY
The Age of Deep Neural Networks

20 KYOTO UNIVERSITY
◼ 2010s saw the rise of deep learning; the trend extended to graph
machine learning ⇒ Graph neural networks (GNNs)
◼ Two (eventually similar) streams of graph neural network design
◼ Graph convolutional neural network
◼ Originates from graph signal processing
◼ Message passing graph neural network
◼ Based on the idea of aggregation of graph substructures
𝐱𝑖
NEW
= aggr 𝐱𝑖 , σ𝑗∈𝑁𝑖
𝐱𝑗
Graph neural network (GNN)
Graph convolution and message passing
Update node representation by
aggregating information of adjacent vertices

21 KYOTO UNIVERSITY
Structure of graph neural networks
Capture large substructures with multiple layers
Forward path
from input graph
to output label
(Quote and edit from Pope et al. CVPR (2019))
Each layer recognizes larger
subgraph features
Prediction based on
entire graph representation

22 KYOTO UNIVERSITY
◼ Various extensions of graph neural networks
◼ Graph attention: introducing attention mechanism into GNNS
◼ Focus on important vertices in information aggregation
◼ Extensions of target graph class
◼ Heterogeneous graph
◼ Hypergraphs
◼ Graph of Graphs (GoG)
◼ E.g., chemical networks
Extensions of GNNs
Attention mechanism and more general graphs
Harada et al. (2020)
Harada, S., Akita, H., Tsubaki, M., Baba, Y., Takigawa, I., Yamanishi, Y., & Kashima, H. (2020).
Dual graph convolutional neural network for predicting chemical networks. BMC bioinformatics, 21, 1-13.

23 KYOTO UNIVERSITY
◼ Traffic forecasting is an important fundamental technology
for realizing intelligent transportation systems
◼ Promising GNN application!
◼ Traffic network can be represented as a graph
◼ Traffics on road segments have a complex relationship to each other
Application of GNN
Traffic prediction
Shirakami, R., Kitahara, T., Takeuchi, K., & Kashima, H. (2023). QTNet: Theory-based queue length prediction for
urban traffic. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining

24 KYOTO UNIVERSITY
◼ Goal: Predict speed, flow, and queue length at each time and location
◼ Physics-informed ML + GNN:
Efficiently incorporate knowledge of traffic engineering as a constraint
◼ Achieved better prediction performance especially in severe congestions
Physics-informed GNN for traffic prediction
Incorporate known knowledge into graph ML
Known relationship among
speed, flow, and queue length
Shirakami, R., Kitahara, T., Takeuchi, K., & Kashima, H. (2023). QTNet: Theory-based queue length prediction for
urban traffic. In ACM SIGKDD Conference on Knowledge Discovery and Data Mining

25 KYOTO UNIVERSITY
1. Over-smoothing:
◼ Using more layers make all vertices converge to the same
representation, which degrades performance
◼ Remedies: Pruning, skip-connection, selective layer weighting, …
2. Limited representation power
Two major technical issues of GNNs
Over-smoothing and limited representation power
https://minyoungg.github.io/MIT-deeplearning-blogs/2021/12/09/oversquashing-in-gnns/

26 KYOTO UNIVERSITY
1. Over-smoothing
2. Limited representation power
◼ Different graphs that cannot be distinguished using WL test and GNN
◼ ↓ All vertices have the same feature after message passing
Two major technical issues of GNNs
Over-smoothing and limited representation power
Indistinguishable
All vertices always have two neighbors with
the same color

27 KYOTO UNIVERSITY
◼ Graph Isomorphism Network (GIN) [Xu et al.,2019]
◼ Adding random features to nodes further strengthens the
representation power [Sato et al., 2021]
◼ Performs well also in practice!
Making GNN more powerful than standard GNNs
Just adding random features strengthen GNN
Xu, K., Hu, W., Leskovec, J., & Jegelka, S. (2019). How Powerful are Graph Neural Networks?. In ICLR.
Sato, R., Yamada, M., & Kashima, H. (2021). Random features strengthen graph neural networks. In SDM.

28 KYOTO UNIVERSITY
The Age of Causal Inference

29 KYOTO UNIVERSITY
◼ Prediction for decision making:
One of most important uses of predictive machine learning is to
support or automate decision making
◼ If we know that a certain customer are likely to buy some products,
we can just recommend them
◼ Treatment effect prediction for better decision making:
◼ If we issue discount coupons for products,
we should issue coupons only for products with the highest effect
◼ We need to consider the causal effect of recommendations on
propensity to buy
Treatment effect prediction
Decision support based on causal effects of actions
× →

30 KYOTO UNIVERSITY
◼ Treatment effect: Differences in outcomes with and without treatment
= Outcome of treatment 𝑌T − Outcome without treatment 𝑌C
◼ Treatment effect can be predicted if both 𝑌T and 𝑌C are predicted
Treatment effect
Quantification of strength of causal relationships
No
Coupon
Outcome of treatment
Outcome without treatment
𝑌T
𝑌C
Treatment effect
Discount coupons promoted
propensity to buy
𝑌T − 𝑌C

31 KYOTO UNIVERSITY
Treatment effect prediction problem
Learning models from data including biased treatments
▪ Training data 𝐱𝑖, 𝑧𝑖, 𝑦𝑖 𝑖=1
𝑁
⚫ 𝐱: input (target of treatment)
⚫ 𝑧 ∈ {0,1}: treatment
⚫ 𝑦: outcome
▪ Goal: prediction model 𝑓: 𝒳 × 𝒵 → 𝒴
◼ Given a target and treatment, predict its outcome
▪ Challenge: Learn unbiased prediction model from biased treatment data
𝑎𝑎, 𝑎𝑎, 𝑎𝑎
target Treated? Outcome
Let us give coupons to
rich people!
Potential car buyers are issued with coupons with higher probability
Risk of
over-estimated
treatment effect
𝐱 𝑧 𝑦

32 KYOTO UNIVERSITY
Deep learning approach to treatment effect prediction
Learning representations independent of treatments
▪ If data is biased, prediction model will be also biased
▪ Various bias reduction methods were proposed in causal inference
▪ In deep learning, intermediate representation of target is learned to be
independent of treatments [Shalit et al., 2017]
– Use independence measure for regularization in representation learning
Treatment
Target
or
Independence measure
(IPM)
X
Outcome
Target representation
Shalit, U., Johansson, F. D., & Sontag, D. (2017). Estimating individual treatment effect:
generalization bounds and algorithms. In International Conference on Machine Learning (ICML)

33 KYOTO UNIVERSITY
◼ Integration of graph machine learning with treatment effect prediction
◼ What are graphs in treatment effect prediction?
◼ Target inputs as graphs:
◼ Input space has graph structure
◼ E.g., Targeted marketing in SNS
◼ Challenge: Interference/spillover effects
◼ Treatments as graphs
◼ Each treatment has a graph structure
◼ E.g., Drug effect estimation
◼ Challenge: Infinite number of treatments
Treatment effect prediction + Graph ML
Treatment targets or treatments as graphs

34 KYOTO UNIVERSITY
Graph-structured treatment targets
GNN considers treatment interference on graph
▪ Treatment effect prediction in a graph-structured input space (e.g., SNS)
▪ Interference of treatments can occur between neighbors
– “My friends have coupons, but I don't...”
▪ GNN extracts features independent of treatments [Ma&Tresp,2021]
– GNN incorporates neighborhood information
– Independence regularization acquires
treatment-independent representations
▪ Extensions to heterogeneous networks
and unknown networks
[Lin et al., 2023,2024]
Ma, Y., & Tresp, V. (2021). Causal inference under networked interference and intervention policy enhancement. In AISTATS.
Lin, X., Zhang, G., Lu, X., Bao, H., Takeuchi, K., & Kashima, H. (2023). Estimating treatment effects under heterogeneous interference. In ECML PKDD.
Lin, X., Zhang, G., Lu, X., & Kashima, H. (2024). Treatment Effect Estimation Under Unknown Interference. In PAKDD.

35 KYOTO UNIVERSITY
Graph-structured treatments
GNN extracts features of graph treatments
▪ Treatment effect prediction of graph-structured treatments (e.g., drugs)
[Harada&Kashima, 2021]
▪ Ensure independence between target representation and treatment
representation of graph treatment extracted by GNN
▪ Zero-shot treatment effect prediction: Applicable to first-time treatments
Target
Independent
(HSIC regularization)
X Outcome
Target representation
G
N
N
Treatment representation
Graph-structured
treatment
Harada, S., & Kashima, H. (2021). GraphITE: Estimating individual effects of graph-structured treatments.
In International Conference on Information & Knowledge Management (CIKM)

37 KYOTO UNIVERSITY
◼ A (personal and biased) look at the history of graph machine learning,
from data mining and kernel methods to graph neural networks
◼ Although the techniques vary from time to time, the ideas of focusing
on substructures and message propagation on graphs are inherited
◼ Many topics omitted: ranking/clustering, structured output prediction
… as well as historical developments in graph signal processing
◼ Graph generation is one of the most important future topics
… as well as dynamic/heterogeneous graphs, privacy/fairness/security
Graph mining, graph kernels, GNN, and causal inference
Graph Machine Learning Graph Signal Processing

Graph Machine Learning - Past, Present, and Future -

More Related Content

Similar to Graph Machine Learning - Past, Present, and Future - (20)

Recently uploaded (20)

Graph Machine Learning - Past, Present, and Future -