Earlier this year, we held the first Workshop on Graph Learning Benchmarks (GLB 2021) at the WebConf 2021, co-organized by Dr. Yuxiao Dong, Prof. Danai Koutra, Prof. Qiaozhu Mei, Jiong Zhu, and myself. The workshop hosted an amazing lineup of invited speakers, including Prof. Jure Leskovec (Stanford) and Prof. Leman Akoglu (CMU), as well as panelists, including Prof. Stephan Günnemann (TU Munich), Prof. Yizhou Sun (UCLA), and Prof. Jie Tang (Tsinghua).

I got many inspirations from the workshop, especially from the fantastic panel discussion, and I think it is a pity that the event was not public. Because of this, I turned my notes into this blog post to share my key inspirations. I have avoided direct quotations from the panlists and presenters for ease of reading, but much of the content below can be credited to the intellectual discussion among the panelists and the speakers, and also some to the workshop paper presenters which I cannot exhaustively list. I also appreciate Qiaozhu, Danai, and Yuxiao for their constructive feedback on this blog post, while all errors are my own.

Graph Learning Tasks Are Diverse

Graph-structured data are ubiquitous, such as social networks, road networks, chemical compounds, and power grid. At the same time, they are also both heterogeneous and diverse. As such, it is unlikely that we can find one single technique that excels at machine learning tasks on all types of graph-structured data. Indeed, while recent literature has demonstrated that graph machine learning models—especially graph neural networks (GNNs)—achieve promising performance on many machine learning tasks, there is a plenty of evidence showing that further improvements on particular tasks often face dramatically different technical bottlenecks.

For example, there is a clear distinction between node-level prediction tasks and graph-level prediction tasks. On one hand, the expressive power of GNNs is mainly a concern for graph-level prediction tasks, and the recently developed higher-order GNNs with better expressive power are almost entirely evaluated on graph-level prediction tasks [1]–[3]. On the other hand, techniques that help GNNs to alleviate the over-smoothing problem [4] are mainly tested on node-level prediction tasks [5], [6].

A higher-order GNN [1] with improved expressive power. Such methods are almost entirely evaluated on graph-level prediction tasks, suggesting a fundamental difference between graph-level tasks and node-level tasks. — A higher-order GNN [1] with improved **expressive power**. Such methods are almost entirely evaluated on graph-level prediction tasks, suggesting a fundamental difference between graph-level tasks and node-level tasks.

Understanding the diversity of graph learning tasks is critical for our community. Researchers must understand the distinction of challenges among different tasks in order to develop suitable graph learning techniques. Practitioners need clear guidance on how to choose from the rapidly growing toolbox of graph learning techniques for the task at hand.

To achieve these goals, we need to build a taxonomy of graph learning tasks that is succinct, but sufficiently informative to capture the distinction among different types of tasks.

A More Comprehensive Task Taxonomy, Please!

In most existing literature [7]–[9], the learning tasks on graphs are roughly categorized, in terms of the abstract formats of input and output, as node-level tasks, edge-level tasks, subgraph-level tasks, and graph-level tasks. However, this data-format-based taxonomy is still over-simplified to meaningfully reflect the diverse graph learning tasks in the real world.

Computer Vision (CV) concepts listed in ACM Computing Classification System. Not all the concepts are CV tasks, and not all CV tasks are included here. But we can already see a diverse branches of CV tasks. In comparison, our knowledge on the graph learning task taxonomy is rather limited.

First, tasks that share the same data format can be very different. For example, in the area of Natural Language Processing (NLP), both Sentiment Analysis and Natural Language Generation can be formalized as a classification task given a sequence of words. But apparently each of the two tasks has unique technical challenges. Similarly, many reasoning tasks on the graph (e.g., finding the shortest path or summarizing the pairwise relationships) can be formulated as a graph-level prediction task. However, the difference in the algorithmic nature of these tasks requires different GNN architectures to generalize well [10]. Being able to clearly distinguish these tasks is the premise to understanding and overcoming their unique challenges.

In addition, real-world graphs have complex structures. For example, the graphs may be heterogeneous [11] and dynamic [12]; the edges may be directed, undirected, weighted, or signed [13]; the connection between nodes may exhibit homophily [14] or follow preferential attachment [15]. The above simplistic taxonomy of graph learning tasks does not mirror such a variety of complex structures in the real world.

The lack of a fine-grained taxonomy of graph learning tasks also has an influence on the current practice of constructing benchmark datasets. For example, early node-level benchmark datasets (such as Cora, Citeseer, and Pubmed [16], [17]) are dominantly homophilious, which in turn drives the development of major GNN models “overfitting” homophilious graphs [18]. Moreover, given the diversity of graph-structured data, there are undoubtedly plenty of interesting graph learning tasks for which we do not, as yet, have (modern) benchmark datasets.

Overall, while the existing simple taxonomy can to some extent help us identify the distinction among graph learning tasks (e.g., over-smoothing is more attached to node-level tasks), it ignores the semantics and complex structures of graph-structured data. It was a good starting point to kick off research on graph learning. But moving forward, we need a more comprehensive taxonomy to further enhance our understanding of the diversity of graph learning tasks.

Having the Task Taxonomy Co-evolve with Benchmark Datasets

The above discussion seems to suggest that we should first construct a fine-grained taxonomy of graph learning tasks, and then establish benchmark datasets according to the taxonomy. In reality, however, even the best experts in our community may not have a clear picture of the ideal task taxonomy.

Knowledge does not arise in a vacuum. A practical approach is to have the task taxonomy co-evolve with the collection of benchmark datasets. On one hand, by broadly soliciting a variety of graph learning datasets, we can gradually identify new branches of tasks whenever we find datasets for which existing graph learning methods do not work well. On the other hand, the development of a more comprehensive taxonomy will guide practice of benchmark evaluation.

The task taxonomy is knowledge built on the graph datasets. We need to collect the data to develop the knowledge, and the knowledge of the task taxonomy can in turn guide the development of new benchmark datasets. Image source (Licensed under CC BY 4.0).

This bottom-up approach to building the taxonomy of graph learning tasks can only be made possible through a collective effort by the community. While such a co-evolution process may naturally happen in our field even without any intervention, we hope the workshop on GLB can contribute to accelerate this process. In particular, we envision possible contributions from the following two perspectives.

Gather ideas on how to solicit graph learning datasets.

We hope the workshop on GLB can serve as a platform to brainstorm ideas of soliciting new graph learning datasets. The panelists and speakers at GLB 2021 suggested two interesting directions.

Create synthetic data. There are already rich literatures in the fields of graph theory and network science that study various characteristics of graphs with clear mathematical formulations, such as the degree distribution, distribution of motifs, and community structures. More importantly, such characteristics are shown to be relevant to real-world phenomena on graphs [19]. Synthetic data capturing these characteristics can help us better understand the limitations of existing methods since we can cleanly control various graph properties (unlike real data, where multiple effects interact with each other).
Closely collaborate with experts from other domains where there are more practically meaningful graph learning tasks. The space of all possible graph learning tasks may be too large to effectively explore. With limited research resources, the development of new graph learning techniques should be driven by practically meaningful tasks/applications. The natural science domains, such as biology, chemistry, and material science, are promising directions to pursue.

Incentivize the effort of creating datasets.

Creating a high-quality dataset takes a huge effort. For example, Prof. Jie Tang mentioned during the panel discussion that the creation of the Open Academic Graph (OAG) datasethttps://www.microsoft.com/en-us/research/project/open-academic-graph/

was an enormous commitment even though OAG is based on two existing datasets, MAG [20] and Aminer [21]. However, personally, I think that the effort of dataset creation has not been sufficiently appreciated in the graph learning community.

First, some datasets are not credited appropriately. For example, the popular benchmark datasets for GNNs, Cora and Citeseer, should at least be credited to both [16] and [17]. However, appropriate citations are frequently missing in the literatureI confess that I made this mistake in one of my papers.

.

Second, in contrast to other areas of machine learning such as NLP, there has not been a dedicated publication track for graph learning datasets prior to the workshop on GLBIt is starting to get more recognized in the broader machine learning community (e.g., the new Datasets and Benchmarks track at NeurIPS 2021, which was not there yet when we had GLB 2021). However, most data mining conferences, which are important publication venues for graph learning, do not have similar tracks yet.

. This increases the barrier to publish work focusing on dataset creation.

To incentivize the important effort of dataset creation in the graph learning community, we should remind our colleagues that proper crediting of such work is good scholarship, and make the work that brings in interesting new datasets easier to publish.

The “Null” Category: Where the Graph is Not Useful

As members of the graph learning community, we are all excited about the numerous successful applications of graph learning methods emerging in this field. However, it is also worth paying attention to the negative results appeared in our practice. Having graph structures available in a dataset does not necessarily mean that the graphs will be helpful for all machine learning tasks on this dataset. And graph-based learning methods are not always better than graph-agnostic learning methods.

During the panel discussion, the panelists mentioned quite a few scenarios where the graph-structures are not very helpful, and they discussed some heuristics about how we can know if the graphs are useful in a learning task.

First, the semantics of the data can provide some insights. For example, we may intuitively know that the road network or power grid network plays a key role in many prediction tasks on top of the networks; while in cases such as image classification on Flickr, the Flickr network may be secondary compared to the image content itself. Second, the source of the graph structure matters. If the graph is constructed as the k-nearest-neighbor graph in terms of the node feature similarity, the graph may not provide much add-on value if we have properly utilized the node features. Third, for some tasks relevant to social networks, the social network structure is often more helpful in a cold-start scenario where little information about the user is available.

With all the discussions above, it seems that we should have a “null” category in the taxonomy of the graph learning task, such a category consists of tasks where the graph is not useful even though it appears in the data. The workshop on GLB also welcomes the submission of negative results, which can help us enhance our understanding on this category. The validity of negative results also needs to be critically examined and reviewed. As sometimes the negative results may be due to bad method or experiment designs. But I think establishing a culture that allows people to publish carefully analyzed negative results is helpful to this field.

Measuring the Progress: More than Prediction Accuracy

In addition to the datasets, an equally important aspect of benchmark study is evaluation metrics. This brings us to the question of how to measure the progress in graph learning research.

There is no doubt that prediction accuracy should not be the only metric that we look at. This point has been evident in the area of machine learning in general, as reflected by the recently established conferences such as FAccT and AIES.

Due to the complex structure of graph learning tasks, issues such as robustness, fairness, and interpretability are particularly complicated in the graph domains.

For node-level tasks, there is structural disparity among the data samples (nodes). Nodes with important structural positions (e.g., high centrality) may have a greater influence on the training of graph-based models, which leads to unique implications on the robustness and fairness of graph learning. In terms of robustness, adversarial attack on graph learning models can be achieved through not only perturbing the features of each sample, but also modifying the structural relationships between samples [22]. Moreover, manipulating some critical data samples, compared to random selection of samples, will lead to significantly severer overall attack consequences [23], [24]. In terms of fairness, there is a theoretically predictable accuracy disparity in graph learning models associated with the structural positions of different nodes [25]. The learning under a heterophily setting is also found to be particularly difficult for low-degree nodes [18].

Nodes on a span different structural positions. Such a structural disparity among nodes have various implications on the graph learning models. For example, in adversarial attack, manipulating some critical nodes, compared to random selection of samples, will lead to significantly severer overall attack consequences [23], [24]. — Nodes on a span different structural positions. Such a **structural disparity** among nodes have various implications on the graph learning models. For example, in adversarial attack, manipulating some critical nodes, compared to random selection of samples, will lead to significantly severer overall attack consequences [23], [24].

For graph-level tasks, the propagation-based graph representation leads to a sparsification of data points in the graph-level representation space, which makes conventional evaluation practice for outlier detection breaks down when it comes to the graph classification problem [26].

Perhaps more critically, we lack principled theoretical foundations for proper analysis of graph learning methods, often due to the non-IID nature of graph-structured data. For example, how to properly do things such as significance testing [27] and cross validation [28] on graph-structured data are still open research areas. Adapting statistical learning theory to non-IID graph learning scenarios is also challenging [25], [29].

I believe in-depth benchmark studies that go beyond accuracy metrics are often the first steps towards understanding such issues.

Final Words

Overall, I think that more effort should be spent on creating a more comprehensive and diverse graph learning task taxonomy. In this blog, I am calling for greater appreciation and involvement from our community for benchmark studies, including the creation of new diverse datasets and tasks, and careful examination and improvement of evaluation practices. As organizers, we hope the workshop on GLB can contribute to the development of graph learning techniques by serving as a dedicated publication channel for benchmark datasets, and by promoting benchmark studies in general.

Advertisement: We are organizing the 2nd workshop on GLB in 2022, which will continue to be non-archivalThe research presented here can be published somewhere else (either before or after the workshop).

. Stay tuned for more information!

References

[1]

C. Morris et al., “Weisfeiler and leman go neural: Higher-order graph neural networks,” in Proceedings of the AAAI conference on artificial intelligence, 2019, vol. 33, pp. 4602–4609.

[2]

K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph neural networks?” International Conference on Learning Representations, 2018.

[3]

H. Maron, H. Ben-Hamu, H. Serviansky, and Y. Lipman, “Provably powerful graph networks,” in Advances in neural information processing systems, 2019, pp. 2156–2167.

[4]

Q. Li, Z. Han, and X.-M. Wu, “Deeper insights into graph convolutional networks for semi-supervised learning,” in Proceedings of the thirty-second AAAI conference on artificial intelligence (AAAI-18), 2018, pp. 3538–3545.

[5]

Y. Rong, W. Huang, T. Xu, and J. Huang, “Dropedge: Towards deep graph convolutional networks on node classification,” International Conference on Learning Representations, 2019.

[6]

L. Zhao and L. Akoglu, “PairNorm: Tackling oversmoothing in GNNs,” International Conference on Learning Representations, 2019.

[7]

W. Hu et al., “Open graph benchmark: Datasets for machine learning on graphs,” Neural Information Processing Systems (NeurIPS), 2020.

[8]

Z. Wu, S. Pan, F. Chen, G. Long, C. Zhang, and S. Y. Philip, “A comprehensive survey on graph neural networks,” IEEE Transactions on Neural Networks and Learning Systems, 2020.

[9]

Z. Zhang, P. Cui, and W. Zhu, “Deep learning on graphs: A survey,” IEEE Transactions on Knowledge and Data Engineering, 2020.

[10]

K. Xu, J. Li, M. Zhang, S. S. Du, K. Kawarabayashi, and S. Jegelka, “What can neural networks reason about?” International Conference on Learning Representations, 2020.

[11]

Y. Sun and J. Han, “Mining heterogeneous information networks: Principles and methodologies,” Synthesis Lectures on Data Mining and Knowledge Discovery, vol. 3, no. 2, pp. 1–159, 2012.

[12]

P. Sarkar, D. Chakrabarti, and M. I. Jordan, “Nonparametric link prediction in dynamic networks,” in Proceedings of the 29th international coference on international conference on machine learning, 2012, pp. 1897–1904.

[13]

J. Leskovec, D. Huttenlocher, and J. Kleinberg, “Signed networks in social media,” in Proceedings of the SIGCHI conference on human factors in computing systems, 2010, pp. 1361–1370.

[14]

M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather: Homophily in social networks,” Annual review of sociology, vol. 27, no. 1, pp. 415–444, 2001.

[15]

A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” science, vol. 286, no. 5439, pp. 509–512, 1999.

[16]

P. Sen, G. Namata, M. Bilgic, L. Getoor, B. Galligher, and T. Eliassi-Rad, “Collective classification in network data,” AI magazine, vol. 29, no. 3, pp. 93–93, 2008.

[17]

Z. Yang, W. Cohen, and R. Salakhudinov, “Revisiting semi-supervised learning with graph embeddings,” in International conference on machine learning, 2016, pp. 40–48.

[18]

J. Zhu, Y. Yan, L. Zhao, M. Heimann, L. Akoglu, and D. Koutra, “Beyond homophily in graph neural networks: Current limitations and effective designs,” Advances in Neural Information Processing Systems, vol. 33, 2020.

[19]

M. Newman, Networks: An introduction. Oxford university press, 2010.

[20]

K. Wang, Z. Shen, C. Huang, C.-H. Wu, Y. Dong, and A. Kanakia, “Microsoft academic graph: When experts are not enough,” Quantitative Science Studies, vol. 1, no. 1, pp. 396–413, 2020.

[21]

J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer: Extraction and mining of academic social networks,” in Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, 2008, pp. 990–998.

[22]

D. Zügner, A. Akbarnejad, and S. Günnemann, “Adversarial attacks on neural networks for graph data,” in Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, 2018, pp. 2847–2856.

[23]

J. Ma, S. Ding, and Q. Mei, “Towards more practical adversarial attacks on graph neural networks,” Advances in neural information processing systems, 2020.

[24]

J. Ma, J. Deng, and Q. Mei, “Adversarial attack on graph neural networks as an influence maximization problem,” Proceedings of the 15th ACM International Conference on Web Search and Data Mining, 2022.

[25]

J. Ma, J. Deng, and Q. Mei, “Subgroup generalization and fairness of graph neural networks,” Advances in Neural Information Processing Systems, 2021.

[26]

L. Zhao and L. Akoglu, “On using classification datasets to evaluate graph outlier detection: Peculiar observations and new insights,” arXiv e-prints, pp. arXiv–2012, 2020.

[27]

S. Günnemann, P. Dao, M. Jamali, and M. Ester, “Assessing the significance of data mining results on graphs with feature vectors,” in 2012 IEEE 12th international conference on data mining, 2012, pp. 270–279.

[28]

T. Li, E. Levina, and J. Zhu, “Network cross-validation by edge sampling,” Biometrika, vol. 107, no. 2, pp. 257–276, 2020.

[29]

A. Baranwal, K. Fountoulakis, and A. Jagannath, “Graph convolution for semi-supervised classification: Improved linear separability and out-of-distribution generalization,” in Proceedings of the 38th international conference on machine learning, 2021, pp. 684–693.