Hyperparameter Optimization for Graph Neural Networks with Transfer Learning

Based on "Benchmarking and Transfer Learning for Hyperparameter Optimization of Graph Neural Networks" – Dědič & Bělohlávek, ICAART 2026.

The Hyperparameter Tuning Problem for GNNs

Hyperparameter optimization (HPO) is the process of finding the right settings that control how a model trains (e.g. learning rate, regularization strength, network width...) as opposed to the parameters the model adjusts automatically during training. Unlike ordinary parameters, hyperparameters can't be tuned by gradient descent: there's no gradient to follow. Instead, the whole training run is a black box, and you have to search from the outside. Each evaluation means running a full training job, which can take anywhere from seconds to hours.

HPO is hard enough for standard neural networks. It gets significantly worse for Graph Neural Networks (GNNs). GNNs expose more interacting hyperparameters than typical architectures, are trained on graphs that can be slow and memory-intensive to process, and are increasingly used in domains where squeezing out the last bit of performance matters, such as network security, molecular biology, recommendation systems. At the same time, the GNN HPO literature is surprisingly fragmented: existing studies typically cover a handful of datasets in a narrow domain, making it difficult to draw general conclusions. This paper addresses that gap with a systematic benchmark, and goes a step further by proposing a method that reuses HPO knowledge across datasets.

Graph Neural Networks Overview

Many real-world datasets are not tables of independent rows but graphs: collections of entities (nodes) connected by relationships (edges). A citation network is a graph – papers are nodes, citations are edges. So are social networks, protein interaction maps, and computer network topologies.

GNNs learn from these relational structures by passing messages between neighbouring nodes. In each layer, every node gathers feature vectors from its neighbours, combines them (via aggregation, e.g., taking the mean or the maximum), and updates its own representation. After a few such rounds, each node holds a compact embedding that reflects both its own attributes and the broader neighbourhood context. The model used in this paper is GraphSAGE, a popular and well-studied variant that samples a fixed-size neighbourhood at each layer and aggregates it.

The task is node classification: given the graph and the embeddings, predict a label for each node (e.g., the research field of a paper, or the type of a network host).

The Hyperparameter Problem for GNNs

GNNs expose a larger-than-usual set of hyperparameters, and many of them interact in non-obvious ways. The paper benchmarks eight hyperparameters for GraphSAGE:

Architecture: number of layers (1-3), hidden layer width (16-512 neurons depending on dataset size), aggregation function (mean vs. max)
Optimization: optimizer (Adam vs. AdamW), learning rate (three logarithmically-spaced choices), number of epochs (fixed per dataset size)
Regularization: dropout rate (0 or 0.5), L₂ weight decay (two or three choices depending on dataset size)
Activation function: ReLU vs. PReLU

Even with these relatively small per-parameter ranges, the combined search space is large and evaluating a single configuration means training a GraphSAGE model to convergence. Exhaustive search over all combinations is completely infeasible.

The HPO problem, formally. Given a labelled dataset , a learning algorithm parameterized by hyperparameters (the search space), and a performance metric (here, the F1-score), an HPO method is a function that at each step proposes the next configuration to evaluate:

The goal is to find using as few evaluations as possible.

How Do You Search Efficiently? – Common HPO Methods

There is a spectrum of HPO strategies, ranging from naive to sophisticated. We evaluated five:

Grid Search exhaustively evaluates a pre-defined grid of hyperparameter values. Simple, reproducible, and easy to parallelise, but the grid must be fixed in advance, it scales exponentially with the number of dimensions, and it cannot adapt based on what it has learned so far.

Random Search samples configurations uniformly at random. Surprisingly competitive: it allocates more budget to truly varying dimensions than grid search does (which wastes evaluations on duplicated values along fixed axes), and it requires no prior knowledge of which values are promising.

Bayesian Optimization (BO) builds a surrogate model of the performance landscape – a cheap probabilistic approximation of the expensive objective function. After each trial, it updates the surrogate and uses an acquisition function to select the next point that best balances exploring unknown regions and exploiting regions that look promising. BO learns from its history and tends to focus evaluations where they matter.

Tree-structured Parzen Estimator (TPE) is a close relative of BO, also from the SMBO family (Sequential Model-Based Optimization). Instead of modelling the performance surface directly, TPE models the density of good configurations and the density of bad ones separately using kernel density estimators (Parzen windows), then selects configurations where the good-to-bad density ratio is highest. It's the default sampler in the popular Optuna framework.

Sobol Quasi-Monte Carlo (QMC) uses a deterministic low-discrepancy sequence to cover the search space more uniformly than pseudo-random sampling. The idea is that better space coverage should mean fewer wasted evaluations. Like random search, it does not learn from previous results.

All five methods share the same outer loop:

history = []
while budget_remaining():
  λ = sampler.suggest(history)          # pick the next configuration
  score = train_and_evaluate(model, λ)  # this is expensive!
  history.append((λ, score))
return best(history)

The difference between methods is entirely in sampler.suggest e.g. whether it takes the history into account.

The Benchmark – What Was Tested

To get a reliable picture, we ran all five methods on nine standard graph datasets spanning a wide range of sizes and structures:

Dataset	#Nodes	#Edges	#Classes
Cora	2,708	10,556	7
CiteSeer	3,327	9,104	6
Squirrel	5,201	198,493	5
PubMed	19,717	88,648	3
CoraFull	19,793	126,842	70
DBLP	17,716	105,734	4
Computers	13,752	491,722	10
Flickr	89,250	899,756	7
ArXiv	169,343	1,166,243	40

Each method-dataset combination is run 10 independent times (to account for randomness in both training and the HPO samplers) and stopped either when no improvement is seen for M consecutive trials (patience) or when a wall-clock budget of T hours is exhausted. The reported metric is macro F1-score on a held-out test set. This systematic design – 5 methods × 9 datasets × 10 runs – is the comprehensive benchmark that was previously missing from the literature.

Benchmark Results – What Works?

The results are in Table 1 below. Bold indicates the best method per dataset; underline indicates second best.

Dataset	Random	Grid	BO	TPE	QMC
Cora	0.8560	0.8491	0.8747	0.8659	0.8351
CiteSeer	0.7146	0.7046	0.7172	0.7236	0.7089
Squirrel	0.3659	0.3605	0.3544	0.3755	0.3549
PubMed	0.8515	0.8457	0.8825	0.8643	0.8596
CoraFull	0.6211	0.6371	0.6450	0.6555	0.6385
DBLP	0.8051	0.7996	0.8118	0.8085	0.8088
Computers	0.6973	0.5999	0.8945	0.8047	0.7428
Flickr	0.0864	0.0961	0.1908	0.1460	0.1069
ArXiv	0.3950	0.3796	0.3987	0.4098	0.3955

Table 1. Final macro F1-scores averaged over 10 independent runs. Bold: best. Underline: second best.

Three patterns stand out clearly. First, SMBO methods (BO and TPE) win consistently. In seven out of nine datasets, both top spots go to BO or TPE. On the Squirrel dataset, random search sneaks into second place; on DBLP, Sobol QMC does. But the overall picture is clear: methods that learn from their history outperform methods that don't. Importantly, the advantage of BO and TPE only becomes visible after about 10 trials – there's an initial warm-up period while the surrogate model is built from scratch. Before that, all methods are roughly equivalent.

Second, random search is a surprisingly solid baseline. It consistently trails SMBO by only a small margin. The two exceptions are Computers and Flickr, where BO dominates with F1 scores more than 0.09 points higher than random – a substantial gap in practice. These datasets appear to have a more complex performance landscape that rewards smarter search.

Third, Sobol QMC underperforms. Despite its theoretical appeal for better space coverage, it never wins on any dataset and consistently ranks below SMBO. Better coverage of the search space doesn't help if you're ignoring the information each trial gives you.

The Meta-Learning Idea – Learning Across Datasets

BO and TPE are good, but they start from scratch on every new dataset. Every HPO run begins with no knowledge of what worked before. A human expert tuning their fifth GNN this month would do better than one tuning their first, because they have intuitions about what tends to work. Can we give a machine that same advantage?

The key observation is that graph datasets have measurable structural fingerprints. You can describe any graph dataset by a vector of numerical properties: how many nodes, how many edges, how densely connected, and crucially, how much the graph's topology aligns with the classification task. Two datasets with similar fingerprints might respond similarly to the same hyperparameter settings. If we've tuned a GNN on eight datasets before, we already have a wealth of evidence about which hyperparameter configurations perform well on which kinds of graphs.

The paper's proposal: build a meta-model that learns to predict, given a dataset's structural fingerprint and a hyperparameter configuration, what F1-score you'd get. This meta-model is cheap to evaluate – it's just a Random Forest, not a GNN training run. So you can use it to pre-rank all candidate configurations and start HPO from the most promising one, rather than starting blindly.

If past experience shows that high-dropout configurations work well on dense, high-homophily graphs, and your new dataset is also dense and high-homophily, why not start there?

How Cross-RF Works

Dataset descriptors

Each dataset is represented by a descriptor vector capturing three categories of graph properties:

Structure (topology): node count, edge count, number of connected components, average node degree, global assortativity (tendency for nodes to connect to similar-degree nodes)
Structure & attributes: average cosine similarity of node feature vectors across all edges
Structure, attributes & labels (task-specific): a suite of homophily measures – edge homophily, node homophily, class homophily, adjusted homophily, attribute homophily, node label informativeness, plus structural statistics specific to each class (positive node degree, presence of positive edges, etc.)

Homophily is perhaps the most important concept here. In a citation network, papers tend to cite papers in the same field – edges mostly connect nodes of the same class. That's high homophily. In some social networks or heterophilic graphs (like Squirrel, which encodes a Wikipedia page-to-page network), connected nodes often belong to different classes – that's low homophily. GNNs behave very differently in these two regimes: averaging neighbour features is helpful when neighbours share the same label, but actively harmful when they don't. Homophily is therefore highly predictive of which GNN configurations will work.

The meta-model and the HPO rule

A Random Forest regressor is trained on a meta-dataset of past runs, where each record is a tuple of (dataset properties, hyperparameter configuration, observed F1-score):

To tune hyperparameters on a new dataset , the method simply picks the configuration that the meta-model predicts will score highest:

Because evaluating a Random Forest is essentially free compared to training a GNN, you can score every candidate configuration in the search space and pick the globally best one. No expensive model training is needed for the initial ranking.

Evaluation with leave-one-dataset-out

While in practice you'd use Cross-RF on a new, unseen dataset, for the purposes of our evaluation, we chose a leave-one-out style experiment. The meta-model for a given target dataset is trained on all runs from the other eight datasets. Cross-RF has never seen the target dataset during training, so its initial recommendation is genuinely a zero-shot transfer from past experience. The Random Forest uses 100 trees and considers 30% of features at each split (standard scikit-learn defaults). Categorical hyperparameters (activation function, optimizer, aggregation) are one-hot encoded.

Meta-Learning Results

Cross-RF is compared only to BO and TPE – the best methods from the benchmark. The results are in Table 2.

Dataset	BO	TPE	Cross-RF
Cora	0.8747	0.8659	0.8727
CiteSeer	0.7172	0.7236	0.7095
Squirrel	0.3544	0.3755	0.3769
PubMed	0.8825	0.8643	0.8807
CoraFull	0.6450	0.6555	0.6426
DBLP	0.8118	0.8085	0.8123
Computers	0.8945	0.8047	0.9023
Flickr	0.1908	0.1460	0.1956
ArXiv	0.3987	0.4098	0.4094

Table 2. Final macro F1-scores for BO, TPE, and Cross-RF, averaged over 10 independent runs. Bold: best. Underline: second best.

Cross-RF achieves the best final performance on 4 out of 9 datasets (Squirrel, DBLP, Computers, Flickr) and is second best on another 3 (Cora, PubMed, ArXiv). It only falls behind both reference methods on CiteSeer and CoraFull. The overall average rank is 1.78 for Cross-RF versus 2.0 for BO and 2.22 for TPE – a consistent advantage.

The largest wins are on Computers and Flickr – exactly the datasets where SMBO already had a large advantage over random search. These are the harder datasets, where the performance landscape has clear structure that rewards prior knowledge. A method that already knows where to look on such datasets, before collecting any data on them, is doing something genuinely useful. The two failure cases (CiteSeer, CoraFull) suggest the limits: transferring knowledge from eight datasets won't always match a method that adapts entirely to the dataset at hand.

Takeaways for Practitioners

If you're tuning a GNN today, a few practical conclusions follow from this work:

Use BO or TPE, not grid search. Both are available out of the box in Optuna and consistently outperform grid and random search on all but the smallest problems. There's no good reason to default to grid search for GNN HPO.
Don't bother with Sobol QMC as a primary strategy. Despite the theoretical motivation for better space coverage, it underperforms in practice on GNN tasks compared to SMBO and even random search on smaller datasets.
Random search is still a reasonable baseline when you have very few evaluations available (under ~10 trials), since SMBO methods need time to build their surrogate models before they start paying off.
Cross-RF was implemented as a standard Optuna sampler. In principle, it can be dropped into any Optuna-based HPO pipeline and immediately leverages the knowledge from these nine benchmark datasets – no retraining needed for new datasets of similar type.
The core enabler is the graph descriptor: a dataset's structural fingerprint (node count, degree statistics, homophily measures, etc.) is both cheap to compute and genuinely predictive of what hyperparameter regime will work. If you can characterize your dataset in these terms, you can get a smarter starting point for free.

What's Next

This work focused on GraphSAGE and macro F1-score across nine datasets – a solid first systematic benchmark, but there's plenty of room to grow. Immediate extensions include testing other GNN architectures (GAT, GCN, GIN) and other performance metrics (AUC-ROC, accuracy). A particularly interesting longer-term direction is pre-training the meta-model on synthetic graphs generated from known probabilistic models (Erdős-Rényi graphs, stochastic block models), which could remove the dependence on having a large collection of real benchmark datasets as training data. Another direction is incorporating Cross-RF as a surrogate model inside an SMBO loop – combining the warm-start advantage of meta-learning with the adaptive refinement of Bayesian optimization for an even tighter integration of past knowledge and current observations.