Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stellar running forever #13

Open
LukasHats opened this issue Oct 4, 2024 · 0 comments
Open

Stellar running forever #13

LukasHats opened this issue Oct 4, 2024 · 0 comments

Comments

@LukasHats
Copy link

THanks for providing stellar.
I am currently trying to run stellar on the Hubmap demo dataset on our Cluster. Although it states that it should finish quite fast, it runs >24h. I see that the GPU gets used, although just around 2.5 MB. I am not sure whats wrong. The loss also gets printed.

My environment:

 Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
anndata                   0.7.6                    pypi_0    pypi
blas                      1.0                         mkl  
blosc2                    2.0.0                    pypi_0    pypi
bottleneck                1.3.7            py38ha9d4c09_0  
brotli-python             1.0.9            py38h6a678d5_8  
ca-certificates           2024.9.24            h06a4308_0  
certifi                   2024.8.30        py38h06a4308_0  
charset-normalizer        3.3.2              pyhd3eb1b0_0  
contourpy                 1.1.1                    pypi_0    pypi
cudatoolkit               11.3.1               h2bc3f7f_2  
cycler                    0.12.1                   pypi_0    pypi
cython                    3.0.11                   pypi_0    pypi
fonttools                 4.54.1                   pypi_0    pypi
h5py                      3.11.0                   pypi_0    pypi
idna                      3.7              py38h06a4308_0  
igraph                    0.9.10                   pypi_0    pypi
imageio                   2.35.1                   pypi_0    pypi
importlib-metadata        8.5.0                    pypi_0    pypi
importlib-resources       6.4.5                    pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46306  
jinja2                    3.1.4            py38h06a4308_0  
joblib                    1.4.2            py38h06a4308_0  
kiwisolver                1.4.7                    pypi_0    pypi
ld_impl_linux-64          2.40                 h12ee557_0  
legacy-api-wrap           1.4                      pypi_0    pypi
libffi                    3.4.4                h6a678d5_1  
libgcc-ng                 11.2.0               h1234567_1  
libgfortran-ng            11.2.0               h00389a5_1  
libgfortran5              11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libstdcxx-ng              11.2.0               h1234567_1  
libuv                     1.48.0               h5eee18b_0  
llvmlite                  0.41.1                   pypi_0    pypi
louvain                   0.7.1                    pypi_0    pypi
markupsafe                2.1.3            py38h5eee18b_0  
matplotlib                3.6.3                    pypi_0    pypi
mkl                       2023.1.0         h213fc3f_46344  
mkl-service               2.4.0            py38h5eee18b_1  
mkl_fft                   1.3.8            py38h5eee18b_0  
mkl_random                1.2.4            py38hdb19cb5_0  
msgpack                   1.1.0                    pypi_0    pypi
natsort                   8.4.0                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
networkx                  3.1              py38h06a4308_0  
ninja                     1.10.2               h06a4308_5  
ninja-base                1.10.2               hd09550d_5  
numba                     0.58.1                   pypi_0    pypi
numexpr                   2.8.4            py38hc78ab66_1  
numpy                     1.22.4                   pypi_0    pypi
openssl                   3.0.15               h5eee18b_0  
packaging                 24.1             py38h06a4308_0  
pandas                    1.3.0                    pypi_0    pypi
patsy                     0.5.6                    pypi_0    pypi
pillow                    10.4.0                   pypi_0    pypi
pip                       24.2             py38h06a4308_0  
platformdirs              3.10.0           py38h06a4308_0  
pooch                     1.7.0            py38h06a4308_0  
py-cpuinfo                9.0.0                    pypi_0    pypi
pyg                       2.0.4           py38_torch_1.10.0_cu113    pyg
pynndescent               0.5.13                   pypi_0    pypi
pyparsing                 3.1.2            py38h06a4308_0  
pysocks                   1.7.1            py38h06a4308_0  
python                    3.8.20               he870216_0  
python-dateutil           2.9.0post0       py38h06a4308_2  
python-louvain            0.1                      pypi_0    pypi
python-tzdata             2023.3             pyhd3eb1b0_0  
pytorch                   1.10.2          py3.8_cuda11.3_cudnn8.2.0_0    pytorch
pytorch-cluster           1.6.0           py38_torch_1.10.0_cu113    pyg
pytorch-mutex             1.0                        cuda    pytorch
pytorch-scatter           2.0.9           py38_torch_1.10.0_cu113    pyg
pytorch-sparse            0.6.13          py38_torch_1.10.0_cu113    pyg
pytorch-spline-conv       1.2.1           py38_torch_1.10.0_cu113    pyg
pytz                      2024.1           py38h06a4308_0  
pywavelets                1.4.1                    pypi_0    pypi
pyyaml                    6.0.1            py38h5eee18b_0  
readline                  8.2                  h5eee18b_0  
requests                  2.32.3           py38h06a4308_0  
scanpy                    1.8.0                    pypi_0    pypi
scikit-image              0.18.0                   pypi_0    pypi
scikit-learn              1.0.2                    pypi_0    pypi
scipy                     1.7.0                    pypi_0    pypi
seaborn                   0.13.2                   pypi_0    pypi
setuptools                75.1.0           py38h06a4308_0  
sinfo                     0.3.4                    pypi_0    pypi
six                       1.16.0             pyhd3eb1b0_1  
sqlite                    3.45.3               h5eee18b_0  
statsmodels               0.14.1                   pypi_0    pypi
stdlib-list               0.10.0                   pypi_0    pypi
tables                    3.8.0                    pypi_0    pypi
tbb                       2021.8.0             hdb19cb5_0  
texttable                 1.7.0                    pypi_0    pypi
threadpoolctl             3.5.0            py38h2f386ee_0  
tifffile                  2023.7.10                pypi_0    pypi
tk                        8.6.14               h39e8969_0  
tqdm                      4.66.5           py38h2f386ee_0  
typing_extensions         4.11.0           py38h06a4308_0  
umap-learn                0.5.6                    pypi_0    pypi
urllib3                   2.2.3            py38h06a4308_0  
wheel                     0.44.0           py38h06a4308_0  
xlrd                      1.2.0                    pypi_0    pypi
xz                        5.4.6                h5eee18b_1  
yacs                      0.1.6              pyhd3eb1b0_1  
yaml                      0.2.5                h7b6447c_0  
zipp                      3.20.2                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_1  

My slurm file

#!/bin/sh
#SBATCH --job-name="STELLAR_demo_2_241002"
#SBATCH --partition=gpu-single
#SBATCH --ntasks=1
#SBATCH --ntasks-per-node=16
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
#SBATCH --mem=350gb

module load devel/cuda
module load devel/miniconda/3
source $MINICONDA_HOME/etc/profile.d/conda.sh
conda activate stellar

cd /gpfs/bwfor/work/ws/hd_bm327-phenotyping_benchmark/stellar/


conda run -n stellar python STELLAR_run.py --dataset Hubmap --num-heads 23

This is the GPU usage

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  |   00000000:8A:00.0 Off |                    0 |
| N/A   31C    P0             71W /  400W |    2371MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A   3981110      C   python                                       2362MiB |
+-----------------------------------------------------------------------------------------+

I have not changed any of the scripts. DOes anyone have a suggestions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant