Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General methods for incorporating permuted networks #1

Open
zietzm opened this issue Jan 4, 2019 · 2 comments
Open

General methods for incorporating permuted networks #1

zietzm opened this issue Jan 4, 2019 · 2 comments

Comments

@zietzm
Copy link
Collaborator

zietzm commented Jan 4, 2019

I am trying to perform a simple example application of XSwap-permuted networks for generic inference tasks. The simplest example that came to mind is to take a simple network (call original network A), drop a fraction of edges (resulting in reduced network B), and attempt to predict the dropped edges. To make the method not dependent on the DWPC metric and gamma hurdle model for it, I'm using both DWPC as well as two other metrics, the Jaccard coefficient and the number of common neighbors for prediction.

Overall, my hypothesis is that degree-preserving random networks contain information that is beneficial to the edge prediction task. Specifically, the permuted networks could provide a way to isolate the confounders of node degree and network density.

It's not obvious exactly how random networks should be incorporated, though. I have a few general ideas, though all have some downsides:

  1. Concatenate metrics based on B and average metrics based on permutations of B. Then compare performance of logistic regression on these features to performance on the B features alone. This is probably the easiest method, though it wouldn't be very interpretable and wouldn't capture any potential nonlinear relationships.

  2. Compute averages of metrics across permutations of B, then subtract averages from B metrics and predict edges. This makes some assumptions about the relationships between metrics and their averages, which may or may not be valid.

  3. Compare performance based on B metrics with performance on metrics of permuted B. This is basically the (delta AUROC) method used in Rephetio. Works well on the task for which it was designed, but I don't think it is appropriate here. We are only dealing with a single metaedge, and it's not clear what we could interpret from computing a delta AUROC here.

  4. (Not yet well-developed) Compute a prior edge probability or prior metric probability using permuted networks and update some conditional probability based either on B metrics themselves or logistic-regression derived probabilities.

  5. (Not yet well-developed) Somewhat like 4 but nonparametric. Rank potential connections using metrics computed on permuted networks. Somehow combine this information with B metrics (whether the metrics themselves or logistic-regression derived probabilities) to either get probabilities for each connection or at least rank potential connections. This would allow us to easily compute an AUROC value that could be compared to a simple no-permutation edge prediction method, though it's not clear how these data could be "combined."

@dhimmel
Copy link
Collaborator

dhimmel commented Jan 5, 2019

I think we should start more simply and just assess the performance of various metrics on B to predict the holdout portion of A. Here are possible metrics we can assess:

  • the prior probability of edge existence (based on B-derived XSwapped networks)
  • random walk edge predictions on homogeneous networks
  • jaccard similarity on homogeneous networks and bipartite networks

Does this sound like a reasonable starting point? This way we can see on different networks how predictive the prior probability of edge existence is compared to other simple metrics that do require access to actual edges.

It would also be interesting to find actual predictions other people created (that we don't have to implement) and compare their performance to the prior edge probability.

@zietzm
Copy link
Collaborator Author

zietzm commented Jan 7, 2019

Do you include DWPC in "Random walk", or do you mean even simpler, simply the nodes that you reach in a certain number of steps, including loops? For the time being, I am going with DWPCs, but the other would be easy to compare.

Using an example homogeneous network, I setup the following prediction task.

  1. Drop 20% of network edges
  2. Compute features using the pruned network
  3. Generate 1000 permutations of the pruned network
  4. Compute features using each permuted network and average each feature to get "priors" for each feature (not doing degree grouping for now)
  5. Construct train/test data:
    1. Sample 70% of the node pairs where an edge was pruned. Get an equal number of node pairs where edge does not exist in original or pruned -> Training data
    2. Remaining 30% of pruned edge node pairs and an equal number of non-existent edge node pairs -> Test data
  6. Scale data using min/max scaling and predict existence of edge in original network using logistic regression

I computed the following features for each of the node pairs:

  • dwpc_2 - DWPC on two-step metapath PiPiP (protein-interacts-protein-interacts-protein)
  • dwpc_3 - DWPC on two-step metapath PiPiPiP
  • jaccard - Jaccard coefficient. For networks, this is computed using the neighboring node sets. Magnitude of intersection divided by magnitude of union.
  • cn - Common neighbors. The number of shared neighbors. In the context of the previous bullet, this is the magnitude of the neighbor set intersection
  • edge_prior - Fraction of permutations in which an edge exists between two nodes
  • dwpc_2_prior - Mean of dwpc_2 computed on each permuted network
  • dwpc_3_prior - Mean of dwpc_3 computed on each permuted network
  • jaccard_prior - Mean of jaccard computed on each permuted network
  • cn_prior - Mean of cn computed on each permuted network

Below are the results of this prediction task using various features

Features F1 Average precision AUROC
dwpc_2, dwpc_3 0.5540 0.7503 0.7469
jaccard 0.3719 0.6059 0.6136
jaccard, cn 0.3806 0.6123 0.6140
jaccard, cn, dwpc_2, dwpc_3 0.5349 0.7486 0.7461
edge_prior 0.5834 0.7490 0.7216
dwpc_2_prior, dwpc_3_prior 0.6585 0.7808 0.7050
jaccard_prior, cn_prior 0.5613 0.7082 0.6684
edge_prior, jaccard_prior, cn_prior, dwpc_2_prior, dwpc_3_prior 0.6594 0.7839 0.7157
jaccard, cn, dwpc_2, dwpc_3, edge_prior, jaccard_prior, cn_prior, dwpc_2_prior, dwpc_3_prior 0.6965 0.8217 0.7473

I plotted some curves to look a little closer at the results. Some of these curves look pretty strange to me, and I'm not sure what would cause this behavior. Note that every task would have had a 50/50 split in true outcomes (ie. 50% of node pairs had an edge that was removed during pruning).

  • edge_prior only
    image
    image

  • dwpc_2, dwpc_3
    image
    image

  • jaccard, cn was actually less informative than jaccard_prior, cn_prior

    • jaccard and cn - These look especially strange. I'm having trouble understanding what would underlying behavior would cause these curves to look this way.
      image
      image

    • Priors (jaccard_prior, cn_prior)
      image
      image

  • All features together (best performance)
    image
    image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants