Course project of SJTU CS3319: Data Science Fundamentals, 2023 spring.
Attention: Discussion & reference welcomed, but NO PLAGIARISM !!!
Here is a link prediction problem in an academic network. We collected 6,611 authors and corresponding 79,937 papers from top journals in the field of GeoScience as well as citation information of their publications. The collected information is used to form an academic network, and there is a feasible way:
Build a heterogeneous network, which contains two types of nodes, one type of nodes represents authors, and the other represents papers. In this network, each edge between an author node and a paper node means that the authors have read the paper (connecting authors and the papers cited by the papers written by the authors), each edge between two author nodes denotes the co-authorship, and each directed edge between two paper nodes represents the citation relation.
This problem can be modelled as a link prediction problem, and your task is to predict each author-paper pairs in the test set based on the information provided. If the paper is recommended to the author, mark it as 1, otherwise mark it as 0.
For further information, see Kaggle competition here.
To address the issues, we propose a Graph Attention Network (GAT) based graph learning model that follows the Relational Graph Convolutional Network (RGCN) aggregation approach. Our designed model achieves an impressive F1-score of 0.9484 in the Kaggle competition, surpassing the majority of participating teams. This report paper provides a comprehensive demonstration of the core architecture of our proposed method. Detailed implementation instructions are also disclosed for the purpose of reproducibility as a course project.
Several predictions with good performance submitted on Kaggle are put in pretrained_predictions
. prediction_Final_0.94841.csv
is the latest submitted version of our team.
Note that you may need to manually download the academic network dataset via this link, unzip it and copy all files in academic_network_data
to the empty directory data
.
- Train model with tuned parameters on GPU (default:
cuda:0
)
python train.py --lr 2e-4 --wd 4e-4 --num_epochs 80
Note:
-
8GB
or more memory of GPU is preferred if you need to train the model yourself. Use--batch_size
to adjust batch size and--cuda -1
for CPU training. - It takes around
100-120
minutes to complete80
training epochs on2080Ti/3070Ti
GPUs with default settings. - Best F1 scores during training may slightly vary among experiments. However,
$0.943\pm 0.003$ F1 score should be reached to indicate a reliable execution. - Predictions will be saved in
new_prediction
.