Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized SCC Implementation and Removed TensorFlow Dependencies #250

Merged
merged 13 commits into from
Oct 9, 2024

Conversation

Starlitnightly
Copy link
Contributor

When installing spateo on macOS, I encountered version incompatibility errors. For instance, the vtk package requires a minimum Python version of 3.9. Additionally, installing both torch and tensorflow simultaneously can lead to package conflicts. After thoroughly reviewing the implementation of the relevant TensorFlow code, I rewrote all TensorFlow-related code using PyTorch and removed all TensorFlow-related dependencies from requirements.txt.

Updates:

  • Re-Implemented NLPCA using PyTorch instead of TensorFlow.
  • Re-Implemented weighted_binary_crossentropy using PyTorch instead of TensorFlow.
  • Re-implemented calculate_leiden_partitionand added logger.info.
  • Optimized label assignment by majority voting in the neighborhood using optimize_cluster.

Notes:

After carefully comparing the clustering effects of Leiden in scanpy and spateo, I found that the enhanced effect is due to dynamo.tl.neighbors compared to scanpy.pp.neighbors. The exact reason for this enhancement still needs further investigation.

Additionally, it is important to note that scc does not achieve state-of-the-art (SOTA) performance and does not yield better results on the gold standard dataset of human cortical neurons.

Since scc is an adjacency matrix that directly combines spatial neighborhoods and PCA neighborhoods, mclust is not applicable and was not included in this PR. Perhaps I could introduce STAGATE into the spateo framework and use it in place of PCA. However, this would introduce a new dependency, pyg, which might complicate updating the existing requirements.txt and incur additional installation costs for users.

image image image

@Starlitnightly
Copy link
Contributor Author

More info about optimize_cluster

image

@Sichao25
Copy link
Contributor

Those updates are awesome; thanks for your contribution! I noticed you've specified matplotlib<=3.6.2 and pandas<=1.5.3 in your changes. Is there a specific reason for keeping these versions? We're currently updating these dependencies, and I also feel they may be causing some dependency issues in the CI.

@Starlitnightly
Copy link
Contributor Author

Thank you for your feedback! I’m glad you found the updates helpful.

Regarding the specific versions of matplotlib and pandas, there are a couple of reasons for those choices:

  1. Pandas: Version 1.5.3 is the last release before Pandas 2.0, which replaced the internal engine with Polars. This update introduced significant changes, including the renaming of many functions and alterations in formatting. Consequently, many packages have limited compatibility to Pandas 1.5.3 to avoid these disruptions.

  2. Matplotlib: For matplotlib, the version restriction to 3.6.2 or 3.6.3 is due to changes in the parameters of some plotting functions in higher versions, such as cmap. To maintain consistency and avoid potential issues, we opted to restrict it to 3.6.2. Without this restriction, we would need to implement version checks before calling plotting functions to ensure compatibility.

However, I understand the importance of keeping our dependencies up-to-date. If the current versions are causing CI issues, I’m more than willing to help investigate and test with the latest versions to ensure everything works smoothly.

@Starlitnightly
Copy link
Contributor Author

@Sichao25 But the latest version of anndata seems to have solved the support bug for pandas versions greater than 2. This requires more testing to ensure compatibility

scverse/anndata#948

@Sichao25
Copy link
Contributor

Sichao25 commented Jul 25, 2024

Thanks for sharing your concerns. We are updating important dependencies like NumPy, pandas, and matplotlib to ensure compatibility with other tools. The current plan includes support for pandas >=1.5.3 and matplotlib>=3.6.2. These updates will make it easier for you to pass the CI and test your changes. Feel free to join the discussion here to share your ideas or report any dependency issues.

About scc, it will be nice to have STAGATE introduced as a new option. How about setting it as an optional dependency that users can install themselves? Specifically, we won't add new dependency directly to requirements.txt, but we will check for its installation before using it by some code like:

try:
    import pyg
except ImportError:

Copy link
Contributor

@Xiaojieqiu Xiaojieqiu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Starlitnightly thanks for this pull request. Your ideas on switching from tensorflow to pytorch is a great one. I also like your optimize_cluster function which can definitely be used to smooth the cluster layer on the space


# The itermediate model gets the output of the bottleneck layer,
# which acts as the projection layer.
self.intermediate_layer_model = Model(inputs=model.input, outputs=model.get_layer(bname).output)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Starlitnightly it looks like your updated code doesn't set the intermediate_layer_model?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so it is always None?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented intermediate_layer_model inside the AutoEncoder later on, which should really be removed in self, I need to test it a bit further

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested this function on simulated data, and it runs well. However, I haven't found specific use cases for this function. Could someone provide specific scenarios where this function is used so that I can further optimize it?

spateo/tools/cluster/find_clusters.py Outdated Show resolved Hide resolved
spateo/tools/cluster/leiden.py Show resolved Hide resolved
@Xiaojieqiu
Copy link
Contributor

@Starlitnightly regarding your comments on scc, we found scc works well on several dataset given its simple formulation. (1) but use STAGATE to replace pca seem like a good idea. (2) are you saying spateo's leiden results are worse than scanpy?

@Starlitnightly
Copy link
Contributor Author

@Starlitnightly regarding your comments on scc, we found scc works well on several dataset given its simple formulation. (1) but use STAGATE to replace pca seem like a good idea. (2) are you saying spateo's leiden results are worse than scanpy?

I compared the implementation of leiden in spateo and scanpy, in fact both are the same, so the result is the same, it is the neighbors that cause the difference in leiden. spateo's results are better compared to scanpy's

@Starlitnightly
Copy link
Contributor Author

Thanks for sharing your concerns. We are updating important dependencies like NumPy, pandas, and matplotlib to ensure compatibility with other tools. The current plan includes support for pandas >=1.5.3 and matplotlib>=3.6.2. These updates will make it easier for you to pass the CI and test your changes. Feel free to join the discussion here to share your ideas or report any dependency issues.

About scc, it will be nice to have STAGATE introduced as a new option. How about setting it as an optional dependency that users can install themselves? Specifically, we won't add new dependency directly to requirements.txt, but we will check for its installation before using it by some code like:

try:
    import pyg
except ImportError:

I'm going to try to implement STAGATE in spateo in a future PR. Since the author doesn't give a usable package directly in pypi, a lot of files and code may need to be added!

@Starlitnightly
Copy link
Contributor Author

Starlitnightly commented Oct 1, 2024

Maybe it need to be added more method for cluster SOTA in this commit.

@Starlitnightly
Copy link
Contributor Author

The latest tutorial can be viewed in the pull request of spateo-tutorial: https://github.com/aristoteleo/spateo-tutorials/blob/bf32d5739f380948e76cbe07c99e3e8d6d1e3627/5_cluster_digitization/1_bin_scc.ipynb

It should be noted that CAST and STAGATE do not perform well on the current dataset, but they outperform SCC on other datasets. Additionally, in the new tutorial, I have modified the method of spatial domain annotation to use dictionary-based annotation instead of sequential annotation.

There are a few packages providing commands. Try e.g. `pip install scanpy-
scripts`!

positional arguments:
  {settings}

options:
  -h, --help  show this help message and exit in pySTAGATE
@Xiaojieqiu
Copy link
Contributor

excellent work. I am going to merge the pull request now.

@Xiaojieqiu Xiaojieqiu merged commit 3c01986 into aristoteleo:main Oct 9, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants