Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with SliRec and DKN OOM #2076

Closed
wants to merge 12 commits into from
Closed

Conversation

miguelgfierro
Copy link
Collaborator

@miguelgfierro miguelgfierro commented Mar 25, 2024

Description

Fix #2063

I run both tests in a local V100 and they pass. In the nightly tests I get: https://github.com/recommenders-team/recommenders/actions/workflows/azureml-gpu-nightly.yml?query=branch%3Amiguel%2Fnightly_oom

The DKN test is already with 5 epochs and BS=200. I don´t know if both tests are conflicting because TF takes the whole memory of the GPU by default.

This shouldn't happen because in the BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True). However, when I run the code in local, I see TF getting all the memory.

Related Issues

References

Checklist:

  • I have followed the contribution guidelines and code style for this project.
  • I have added tests covering my contributions.
  • I have updated the documentation accordingly.
  • I have signed the commits, e.g. git commit -s -m "your commit message".
  • This PR is being made to staging branch AND NOT TO main branch.

Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
Signed-off-by: miguelgfierro <[email protected]>
@miguelgfierro
Copy link
Collaborator Author

miguelgfierro commented Mar 25, 2024

If I comment all tests except SLiRec, then it works, and it is quick:

INFO:submit_groupwise_azureml_pytest.py:Executing tests now...
============================= test session starts ==============================
platform linux -- Python 3.10.13, pytest-8.1.1, pluggy-1.4.0
rootdir: /mnt/azureml/cr/j/a2150f1522ab4d47941d5730dcfc5eb8/exe/wd
configfile: pyproject.toml
plugins: mock-3.14.0, hypothesis-6.99.13, typeguard-4.2.1, cov-5.0.0, anyio-4.3.0
collected 2 items

tests/unit/examples/test_notebooks_gpu.py .                              [ 50%]
tests/functional/examples/test_notebooks_gpu.py 

============================== slowest durations ===============================
343.83s call     tests/functional/examples/test_notebooks_gpu.py::test_slirec_quickstart_functional[recommenders/models/deeprec/config/sli_rec.yaml-tests/resources/deeprec/slirec-5-300-expected_values0-42]
1.36s call     tests/unit/examples/test_notebooks_gpu.py::test_gpu_vm

(4 durations < 0.005s hidden.  Use -vv to show these durations.)
================== 2 passed, 12 warnings in 348.30s (0:05:48) ==================
INFO:submit_groupwise_azureml_pytest.py:Test execution completed!
Cleaning up all outstanding Run operations, waiting 300.0 seconds
1 items cleaning up...
Cleanup took 5.421173095703125 seconds

See https://github.com/recommenders-team/recommenders/actions/runs/8422556385/job/23062115526

It works with:

        "tests/functional/examples/test_notebooks_gpu.py::test_dkn_quickstart_functional", 
        "tests/functional/examples/test_notebooks_gpu.py::test_slirec_quickstart_functional",

https://github.com/recommenders-team/recommenders/actions/runs/8424096579

If I have lightgcn, SLiRec and DKN 4a484e8, I get an OOM https://github.com/recommenders-team/recommenders/actions/runs/8427004517/job/23076600230#step:3:8023

@miguelgfierro
Copy link
Collaborator Author

@SimonYansenZhao feel free to continue in this branch #2076

The problem I'm seeing is that each test for some reason keeps the memory so the next test get an OOM. I tried tf.keras.backend.clear_session() but it didn't work. Also, in BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True). However, when I run the code in local, I see TF getting all the memory.

@SimonYansenZhao
Copy link
Collaborator

@SimonYansenZhao feel free to continue in this branch #2076

The problem I'm seeing is that each test for some reason keeps the memory so the next test get an OOM. I tried tf.keras.backend.clear_session() but it didn't work. Also, in BaseModel class we have gpu_options = tf.compat.v1.GPUOptions(allow_growth=True). However, when I run the code in local, I see TF getting all the memory.

Yeah, I ran each tests individually today, and all passed. I'll try to run the three tests together.

@SimonYansenZhao
Copy link
Collaborator

@miguelgfierro all tests including nightly builds passed after #2077 was merged. Not sure what's going on with the OOM issue. But I think we can merge staging into main now.

@miguelgfierro
Copy link
Collaborator Author

@SimonYansenZhao I´m freaking out.

@miguelgfierro
Copy link
Collaborator Author

Let me run again the nightlys

@miguelgfierro miguelgfierro deleted the miguel/nightly_oom branch April 5, 2024 08:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants