Based on "Benchmarking and Transfer Learning for Hyperparameter Optimization of Graph Neural Networks" – Dědič & Bělohlávek, ICAART 2026.
Hyperparameter optimization (HPO) is the process of finding the right settings that control how a model trains (e.g. learning rate, regularization strength, network width...) as opposed to the parameters the model adjusts automatically during training. Unlike ordinary parameters, hyperparameters can't be tuned by gradient descent: there's no gradient to follow. Instead, the whole training run is a black box, and you have to search from the outside. Each evaluation means running a full training job, which can take anywhere from seconds to hours.
HPO is hard enough for standard neural networks. It gets significantly worse for Graph Neural Networks (GNNs). GNNs expose more interacting hyperparameters than typical architectures, are trained on graphs that can be slow and memory-intensive to process, and are increasingly used in domains where squeezing out the last bit of performance matters, such as network security, molecular biology, recommendation systems. At the same time, the GNN HPO literature is surprisingly fragmented: existing studies typically cover a handful of datasets in a narrow domain, making it difficult to draw general conclusions. This paper addresses that gap with a systematic benchmark, and goes a step further by proposing a method that reuses HPO knowledge across datasets.
Many real-world datasets are not tables of independent rows but graphs: collections of entities (nodes) connected by relationships (edges). A citation network is a graph – papers are nodes, citations are edges. So are social networks, protein interaction maps, and computer network topologies.
GNNs learn from these relational structures by passing messages between neighbouring nodes. In each layer, every node gathers feature vectors from its neighbours, combines them (via aggregation, e.g., taking the mean or the maximum), and updates its own representation. After a few such rounds, each node holds a compact embedding that reflects both its own attributes and the broader neighbourhood context. The model used in this paper is GraphSAGE, a popular and well-studied variant that samples a fixed-size neighbourhood at each layer and aggregates it.
The task is node classification: given the graph and the embeddings, predict a label for each node (e.g., the research field of a paper, or the type of a network host).
GNNs expose a larger-than-usual set of hyperparameters, and many of them interact in non-obvious ways. The paper benchmarks eight hyperparameters for GraphSAGE:
Even with these relatively small per-parameter ranges, the combined search space is large and evaluating a single configuration means training a GraphSAGE model to convergence. Exhaustive search over all combinations is completely infeasible.
The HPO problem, formally. Given a labelled dataset , a learning algorithm parameterized by hyperparameters (the search space), and a performance metric (here, the F1-score), an HPO method is a function that at each step proposes the next configuration to evaluate:
The goal is to find using as few evaluations as possible.
There is a spectrum of HPO strategies, ranging from naive to sophisticated. We evaluated five:
Grid Search exhaustively evaluates a pre-defined grid of hyperparameter values. Simple, reproducible, and easy to parallelise, but the grid must be fixed in advance, it scales exponentially with the number of dimensions, and it cannot adapt based on what it has learned so far.
Random Search samples configurations uniformly at random. Surprisingly competitive: it allocates more budget to truly varying dimensions than grid search does (which wastes evaluations on duplicated values along fixed axes), and it requires no prior knowledge of which values are promising.
Bayesian Optimization (BO) builds a surrogate model of the performance landscape – a cheap probabilistic approximation of the expensive objective function. After each trial, it updates the surrogate and uses an acquisition function to select the next point that best balances exploring unknown regions and exploiting regions that look promising. BO learns from its history and tends to focus evaluations where they matter.
Tree-structured Parzen Estimator (TPE) is a close relative of BO, also from the SMBO family (Sequential Model-Based Optimization). Instead of modelling the performance surface directly, TPE models the density of good configurations and the density of bad ones separately using kernel density estimators (Parzen windows), then selects configurations where the good-to-bad density ratio is highest. It's the default sampler in the popular Optuna framework.
Sobol Quasi-Monte Carlo (QMC) uses a deterministic low-discrepancy sequence to cover the search space more uniformly than pseudo-random sampling. The idea is that better space coverage should mean fewer wasted evaluations. Like random search, it does not learn from previous results.
All five methods share the same outer loop:
history = []
while budget_remaining():
λ = sampler.suggest(history) # pick the next configuration
score = train_and_evaluate(model, λ) # this is expensive!
history.append((λ, score))
return best(history)The difference between methods is entirely in sampler.suggest e.g. whether it takes the history into account.
To get a reliable picture, we ran all five methods on nine standard graph datasets spanning a wide range of sizes and structures:
| Dataset | #Nodes | #Edges | #Classes |
|---|---|---|---|
| Cora | 2,708 | 10,556 | 7 |
| CiteSeer | 3,327 | 9,104 | 6 |
| Squirrel | 5,201 | 198,493 | 5 |
| PubMed | 19,717 | 88,648 | 3 |
| CoraFull | 19,793 | 126,842 | 70 |
| DBLP | 17,716 | 105,734 | 4 |
| Computers | 13,752 | 491,722 | 10 |
| Flickr | 89,250 | 899,756 | 7 |
| ArXiv | 169,343 | 1,166,243 | 40 |
Each method-dataset combination is run 10 independent times (to account for randomness in both training and the HPO samplers) and stopped either when no improvement is seen for M consecutive trials (patience) or when a wall-clock budget of T hours is exhausted. The reported metric is macro F1-score on a held-out test set. This systematic design – 5 methods × 9 datasets × 10 runs – is the comprehensive benchmark that was previously missing from the literature.
The results are in Table 1 below. Bold indicates the best method per dataset; underline indicates second best.
| Dataset | Random | Grid | BO | TPE | QMC |
|---|---|---|---|---|---|
| Cora | 0.8560 | 0.8491 | 0.8747 | 0.8659 | 0.8351 |
| CiteSeer | 0.7146 | 0.7046 | 0.7172 | 0.7236 | 0.7089 |
| Squirrel | 0.3659 | 0.3605 | 0.3544 | 0.3755 | 0.3549 |
| PubMed | 0.8515 | 0.8457 | 0.8825 | 0.8643 | 0.8596 |
| CoraFull | 0.6211 | 0.6371 | 0.6450 | 0.6555 | 0.6385 |
| DBLP | 0.8051 | 0.7996 | 0.8118 | 0.8085 | 0.8088 |
| Computers | 0.6973 | 0.5999 | 0.8945 | 0.8047 | 0.7428 |
| Flickr | 0.0864 | 0.0961 | 0.1908 | 0.1460 | 0.1069 |
| ArXiv | 0.3950 | 0.3796 | 0.3987 | 0.4098 | 0.3955 |
Table 1. Final macro F1-scores averaged over 10 independent runs. Bold: best. Underline: second best.
Three patterns stand out clearly. First, SMBO methods (BO and TPE) win consistently. In seven out of nine datasets, both top spots go to BO or TPE. On the Squirrel dataset, random search sneaks into second place; on DBLP, Sobol QMC does. But the overall picture is clear: methods that learn from their history outperform methods that don't. Importantly, the advantage of BO and TPE only becomes visible after about 10 trials – there's an initial warm-up period while the surrogate model is built from scratch. Before that, all methods are roughly equivalent.
Second, random search is a surprisingly solid baseline. It consistently trails SMBO by only a small margin. The two exceptions are Computers and Flickr, where BO dominates with F1 scores more than 0.09 points higher than random – a substantial gap in practice. These datasets appear to have a more complex performance landscape that rewards smarter search.
Third, Sobol QMC underperforms. Despite its theoretical appeal for better space coverage, it never wins on any dataset and consistently ranks below SMBO. Better coverage of the search space doesn't help if you're ignoring the information each trial gives you.
BO and TPE are good, but they start from scratch on every new dataset. Every HPO run begins with no knowledge of what worked before. A human expert tuning their fifth GNN this month would do better than one tuning their first, because they have intuitions about what tends to work. Can we give a machine that same advantage?
The key observation is that graph datasets have measurable structural fingerprints. You can describe any graph dataset by a vector of numerical properties: how many nodes, how many edges, how densely connected, and crucially, how much the graph's topology aligns with the classification task. Two datasets with similar fingerprints might respond similarly to the same hyperparameter settings. If we've tuned a GNN on eight datasets before, we already have a wealth of evidence about which hyperparameter configurations perform well on which kinds of graphs.
The paper's proposal: build a meta-model that learns to predict, given a dataset's structural fingerprint and a hyperparameter configuration, what F1-score you'd get. This meta-model is cheap to evaluate – it's just a Random Forest, not a GNN training run. So you can use it to pre-rank all candidate configurations and start HPO from the most promising one, rather than starting blindly.
If past experience shows that high-dropout configurations work well on dense, high-homophily graphs, and your new dataset is also dense and high-homophily, why not start there?
Each dataset is represented by a descriptor vector capturing three categories of graph properties:
Homophily is perhaps the most important concept here. In a citation network, papers tend to cite papers in the same field – edges mostly connect nodes of the same class. That's high homophily. In some social networks or heterophilic graphs (like Squirrel, which encodes a Wikipedia page-to-page network), connected nodes often belong to different classes – that's low homophily. GNNs behave very differently in these two regimes: averaging neighbour features is helpful when neighbours share the same label, but actively harmful when they don't. Homophily is therefore highly predictive of which GNN configurations will work.
A Random Forest regressor is trained on a meta-dataset of past runs, where each record is a tuple of (dataset properties, hyperparameter configuration, observed F1-score):
To tune hyperparameters on a new dataset , the method simply picks the configuration that the meta-model predicts will score highest:
Because evaluating a Random Forest is essentially free compared to training a GNN, you can score every candidate configuration in the search space and pick the globally best one. No expensive model training is needed for the initial ranking.
While in practice you'd use Cross-RF on a new, unseen dataset, for the purposes of our evaluation, we chose a leave-one-out style experiment. The meta-model for a given target dataset is trained on all runs from the other eight datasets. Cross-RF has never seen the target dataset during training, so its initial recommendation is genuinely a zero-shot transfer from past experience. The Random Forest uses 100 trees and considers 30% of features at each split (standard scikit-learn defaults). Categorical hyperparameters (activation function, optimizer, aggregation) are one-hot encoded.
Cross-RF is compared only to BO and TPE – the best methods from the benchmark. The results are in Table 2.
| Dataset | BO | TPE | Cross-RF |
|---|---|---|---|
| Cora | 0.8747 | 0.8659 | 0.8727 |
| CiteSeer | 0.7172 | 0.7236 | 0.7095 |
| Squirrel | 0.3544 | 0.3755 | 0.3769 |
| PubMed | 0.8825 | 0.8643 | 0.8807 |
| CoraFull | 0.6450 | 0.6555 | 0.6426 |
| DBLP | 0.8118 | 0.8085 | 0.8123 |
| Computers | 0.8945 | 0.8047 | 0.9023 |
| Flickr | 0.1908 | 0.1460 | 0.1956 |
| ArXiv | 0.3987 | 0.4098 | 0.4094 |
Table 2. Final macro F1-scores for BO, TPE, and Cross-RF, averaged over 10 independent runs. Bold: best. Underline: second best.
Cross-RF achieves the best final performance on 4 out of 9 datasets (Squirrel, DBLP, Computers, Flickr) and is second best on another 3 (Cora, PubMed, ArXiv). It only falls behind both reference methods on CiteSeer and CoraFull. The overall average rank is 1.78 for Cross-RF versus 2.0 for BO and 2.22 for TPE – a consistent advantage.
The largest wins are on Computers and Flickr – exactly the datasets where SMBO already had a large advantage over random search. These are the harder datasets, where the performance landscape has clear structure that rewards prior knowledge. A method that already knows where to look on such datasets, before collecting any data on them, is doing something genuinely useful. The two failure cases (CiteSeer, CoraFull) suggest the limits: transferring knowledge from eight datasets won't always match a method that adapts entirely to the dataset at hand.
If you're tuning a GNN today, a few practical conclusions follow from this work:
This work focused on GraphSAGE and macro F1-score across nine datasets – a solid first systematic benchmark, but there's plenty of room to grow. Immediate extensions include testing other GNN architectures (GAT, GCN, GIN) and other performance metrics (AUC-ROC, accuracy). A particularly interesting longer-term direction is pre-training the meta-model on synthetic graphs generated from known probabilistic models (Erdős-Rényi graphs, stochastic block models), which could remove the dependence on having a large collection of real benchmark datasets as training data. Another direction is incorporating Cross-RF as a surrogate model inside an SMBO loop – combining the warm-start advantage of meta-learning with the adaptive refinement of Bayesian optimization for an even tighter integration of past knowledge and current observations.